Let's show a few convenient methods to deal with Missing Data in pandas:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A':[1,2,np.nan],
'B':[5,np.nan,np.nan],
'C':[1,2,3]})
df
| A | B | C | |
|---|---|---|---|
| 0 | 1.0 | 5.0 | 1 |
| 1 | 2.0 | NaN | 2 |
| 2 | NaN | NaN | 3 |
import numpy as np
df[df.eq(0)] = np.nan
df
| A | B | C | |
|---|---|---|---|
| 0 | 1.0 | 5.0 | 1 |
| 1 | 2.0 | NaN | 2 |
| 2 | NaN | NaN | 3 |
df.dropna()
| A | B | C | |
|---|---|---|---|
| 0 | 1.0 | 5.0 | 1 |
df.dropna(axis=1)
| C | |
|---|---|
| 0 | 1 |
| 1 | 2 |
| 2 | 3 |
df.dropna(thresh=2)
| A | B | C | |
|---|---|---|---|
| 0 | 1.0 | 5.0 | 1 |
| 1 | 2.0 | NaN | 2 |
df.fillna(value='FILL VALUE')
| A | B | C | |
|---|---|---|---|
| 0 | 1.0 | 5.0 | 1 |
| 1 | 2.0 | FILL VALUE | 2 |
| 2 | FILL VALUE | FILL VALUE | 3 |
df['A'].fillna(value=df['A'].mean())
0 1.0 1 2.0 2 1.5 Name: A, dtype: float64
dummy_df = pd.read_csv('dummy_data.csv')
dummy_df
| Sno | Name | Age | Height(cm) | |
|---|---|---|---|---|
| 0 | 1 | John | 25.0 | 160.0 |
| 1 | 2 | Jimmy | 26.0 | 163.0 |
| 2 | 3 | Felicia | 28.0 | 154.0 |
| 3 | 4 | Sophia | NaN | 143.0 |
| 4 | 5 | Bob | NaN | NaN |
| 5 | 6 | Billy | 30.0 | 156.0 |
| 6 | 7 | Kate | 31.0 | 160.0 |
| 7 | 8 | Will | 29.0 | NaN |
| 8 | 9 | Scott | NaN | 148.0 |
There can be multiple reasons for missing values in a dataset
dummy_df.describe()
| Sno | Age | Height(cm) | |
|---|---|---|---|
| count | 9.000000 | 6.000000 | 7.000000 |
| mean | 5.000000 | 28.166667 | 154.857143 |
| std | 2.738613 | 2.316607 | 7.174691 |
| min | 1.000000 | 25.000000 | 143.000000 |
| 25% | 3.000000 | 26.500000 | 151.000000 |
| 50% | 5.000000 | 28.500000 | 156.000000 |
| 75% | 7.000000 | 29.750000 | 160.000000 |
| max | 9.000000 | 31.000000 | 163.000000 |
Here count returns Non-null values
len(dummy_df)
9
string_dummy_df = pd.read_csv('dummy_str_data.csv',index_col=0)
dummy_df.isna()
| Sno | Name | Age | Height(cm) | |
|---|---|---|---|---|
| 0 | False | False | False | False |
| 1 | False | False | False | False |
| 2 | False | False | False | False |
| 3 | False | False | True | False |
| 4 | False | False | True | True |
| 5 | False | False | False | False |
| 6 | False | False | False | False |
| 7 | False | False | False | True |
| 8 | False | False | True | False |
dummy_df.notna()
| Sno | Name | Age | Height(cm) | |
|---|---|---|---|---|
| 0 | True | True | True | True |
| 1 | True | True | True | True |
| 2 | True | True | True | True |
| 3 | True | True | False | True |
| 4 | True | True | False | False |
| 5 | True | True | True | True |
| 6 | True | True | True | True |
| 7 | True | True | True | False |
| 8 | True | True | False | True |
dummy_df.isnull()
dummy_df.isnull().sum()
Sno 0 Name 0 Age 3 Height(cm) 2 dtype: int64
string_dummy_df = pd.read_csv('dummy_str_data.csv', index_col=0)
string_dummy_df
| Device_name | Device_description | Single-Use | |
|---|---|---|---|
| Sno | |||
| 1 | Synringe | Used to inject medicine | True |
| 2 | Ventilator | Used to help patients breath | False |
| 3 | Surgical Gloves | NaN | True |
| 4 | Stethescopes | NaN | NaN |
| 5 | Vials container | NaN | NaN |
time_df = pd.read_csv('dummy_time.csv', index_col=0)
time_df
| Sno | Name | Age | Height(cm) | birthday | |
|---|---|---|---|---|---|
| 0 | 1 | John | 25.0 | 160.0 | 1994-01-01 |
| 1 | 2 | Jimmy | 26.0 | 163.0 | NaN |
| 2 | 3 | Felicia | 28.0 | 154.0 | 1995-01-01 |
| 3 | 4 | Sophia | NaN | 143.0 | NaN |
| 4 | 5 | Bob | NaN | NaN | 1994-01-01 |
| 5 | 6 | Billy | 30.0 | 156.0 | 1994-01-01 |
| 6 | 7 | Kate | 31.0 | 160.0 | 1990-01-01 |
| 7 | 8 | Will | 29.0 | NaN | 1991-07-01 |
| 8 | 9 | Scott | NaN | 148.0 | NaN |
type(time_df['birthday'][0])
str
time_df['birthday'] = pd.to_datetime(time_df['birthday'])
time_df
| Sno | Name | Age | Height(cm) | birthday | |
|---|---|---|---|---|---|
| 0 | 1 | John | 25.0 | 160.0 | 1994-01-01 |
| 1 | 2 | Jimmy | 26.0 | 163.0 | NaT |
| 2 | 3 | Felicia | 28.0 | 154.0 | 1995-01-01 |
| 3 | 4 | Sophia | NaN | 143.0 | NaT |
| 4 | 5 | Bob | NaN | NaN | 1994-01-01 |
| 5 | 6 | Billy | 30.0 | 156.0 | 1994-01-01 |
| 6 | 7 | Kate | 31.0 | 160.0 | 1990-01-01 |
| 7 | 8 | Will | 29.0 | NaN | 1991-07-01 |
| 8 | 9 | Scott | NaN | 148.0 | NaT |
DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's use pandas to explore this topic!
import pandas as pd
import numpy as np
from numpy.random import randn
np.random.seed(101)
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())
df
| W | X | Y | Z | |
|---|---|---|---|---|
| A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
| B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
| C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
| D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
| E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
pd.DataFrame(randn(5,4))
| 0 | 1 | 2 | 3 | |
|---|---|---|---|---|
| 0 | 0.302665 | 1.693723 | -1.706086 | -1.159119 |
| 1 | -0.134841 | 0.390528 | 0.166905 | 0.184502 |
| 2 | 0.807706 | 0.072960 | 0.638787 | 0.329646 |
| 3 | -0.497104 | -0.754070 | -0.943406 | 0.484752 |
| 4 | -0.116773 | 1.901755 | 0.238127 | 1.996652 |
df
| W | X | Y | Z | |
|---|---|---|---|---|
| A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
| B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
| C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
| D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
| E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
Let's learn the various methods to grab data from a DataFrame
df['W']
A 2.706850 B 0.651118 C -2.018168 D 0.188695 E 0.190794 Name: W, dtype: float64
# Pass a list of column names
df[['W','Z']]
| W | Z | |
|---|---|---|
| A | 2.706850 | 0.503826 |
| B | 0.651118 | 0.605965 |
| C | -2.018168 | -0.589001 |
| D | 0.188695 | 0.955057 |
| E | 0.190794 | 0.683509 |
# SQL Syntax (NOT RECOMMENDED!)
df.W
A 2.706850 B 0.651118 C -2.018168 D 0.188695 E 0.190794 Name: W, dtype: float64
DataFrame Columns are just Series
type(df['W'])
pandas.core.series.Series
Creating a new column:
df['new'] = df['W'] + df['Y']
df
| W | X | Y | Z | new | |
|---|---|---|---|---|---|
| A | 2.706850 | 0.628133 | 0.907969 | 0.503826 | 3.614819 |
| B | 0.651118 | -0.319318 | -0.848077 | 0.605965 | -0.196959 |
| C | -2.018168 | 0.740122 | 0.528813 | -0.589001 | -1.489355 |
| D | 0.188695 | -0.758872 | -0.933237 | 0.955057 | -0.744542 |
| E | 0.190794 | 1.978757 | 2.605967 | 0.683509 | 2.796762 |
Removing Columns
df.drop('new',axis=1)
| W | X | Y | Z | |
|---|---|---|---|---|
| A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
| B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
| C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
| D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
| E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
# Not inplace unless specified!
df
| W | X | Y | Z | new | |
|---|---|---|---|---|---|
| A | 2.706850 | 0.628133 | 0.907969 | 0.503826 | 3.614819 |
| B | 0.651118 | -0.319318 | -0.848077 | 0.605965 | -0.196959 |
| C | -2.018168 | 0.740122 | 0.528813 | -0.589001 | -1.489355 |
| D | 0.188695 | -0.758872 | -0.933237 | 0.955057 | -0.744542 |
| E | 0.190794 | 1.978757 | 2.605967 | 0.683509 | 2.796762 |
df.drop('new',axis=1,inplace=True)
df
| W | X | Y | Z | |
|---|---|---|---|---|
| A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
| B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
| C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
| D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
| E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
Can also drop rows this way:
df.drop('E',axis=0)
| W | X | Y | Z | |
|---|---|---|---|---|
| A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
| B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
| C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
| D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
Selecting Rows
df.loc['A']
W 2.706850 X 0.628133 Y 0.907969 Z 0.503826 Name: A, dtype: float64
df.loc['C']
W -2.018168 X 0.740122 Y 0.528813 Z -0.589001 Name: C, dtype: float64
Or select based off of position instead of label
df.iloc[2] #uporertar motoi
W -2.018168 X 0.740122 Y 0.528813 Z -0.589001 Name: C, dtype: float64
Selecting subset of rows and columns
df.loc['B','Y']
-0.8480769834036315
df.loc[['A','B'],['W','Y']]
| W | Y | |
|---|---|---|
| A | 2.706850 | 0.907969 |
| B | 0.651118 | -0.848077 |
An important feature of pandas is conditional selection using bracket notation, very similar to numpy:
df
| W | X | Y | Z | |
|---|---|---|---|---|
| A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
| B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
| C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
| D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
| E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
df>0
| W | X | Y | Z | |
|---|---|---|---|---|
| A | True | True | True | True |
| B | True | False | False | True |
| C | False | True | True | False |
| D | True | False | False | True |
| E | True | True | True | True |
df[df>0]
| W | X | Y | Z | |
|---|---|---|---|---|
| A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
| B | 0.651118 | NaN | NaN | 0.605965 |
| C | NaN | 0.740122 | 0.528813 | NaN |
| D | 0.188695 | NaN | NaN | 0.955057 |
| E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
df[df['W']>0]
| W | X | Y | Z | |
|---|---|---|---|---|
| A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
| B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
| D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
| E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
df[df['W']>0]['Y']
A 0.907969 B -0.848077 D -0.933237 E 2.605967 Name: Y, dtype: float64
df[df['W']>0][['Y','X']]
| Y | X | |
|---|---|---|
| A | 0.907969 | 0.628133 |
| B | -0.848077 | -0.319318 |
| D | -0.933237 | -0.758872 |
| E | 2.605967 | 1.978757 |
For two conditions you can use | and & with parenthesis:
df[(df['W']>0) & (df['Y'] > 1)]
| W | X | Y | Z | |
|---|---|---|---|---|
| E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
Let's discuss some more features of indexing, including resetting the index or setting it something else. We'll also talk about index hierarchy!
df
| W | X | Y | Z | |
|---|---|---|---|---|
| A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
| B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
| C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
| D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
| E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
# Reset to default 0,1...n index
df.reset_index()
| index | W | X | Y | Z | |
|---|---|---|---|---|---|
| 0 | A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
| 1 | B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
| 2 | C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
| 3 | D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
| 4 | E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
newind = 'CA NY WY OR CO'.split()
df['States'] = newind
df
| W | X | Y | Z | States | |
|---|---|---|---|---|---|
| A | 2.706850 | 0.628133 | 0.907969 | 0.503826 | CA |
| B | 0.651118 | -0.319318 | -0.848077 | 0.605965 | NY |
| C | -2.018168 | 0.740122 | 0.528813 | -0.589001 | WY |
| D | 0.188695 | -0.758872 | -0.933237 | 0.955057 | OR |
| E | 0.190794 | 1.978757 | 2.605967 | 0.683509 | CO |
df.set_index('States')
| W | X | Y | Z | |
|---|---|---|---|---|
| States | ||||
| CA | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
| NY | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
| WY | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
| OR | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
| CO | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
df
| W | X | Y | Z | States | |
|---|---|---|---|---|---|
| A | 2.706850 | 0.628133 | 0.907969 | 0.503826 | CA |
| B | 0.651118 | -0.319318 | -0.848077 | 0.605965 | NY |
| C | -2.018168 | 0.740122 | 0.528813 | -0.589001 | WY |
| D | 0.188695 | -0.758872 | -0.933237 | 0.955057 | OR |
| E | 0.190794 | 1.978757 | 2.605967 | 0.683509 | CO |
df.set_index('States',inplace=True)
df
| W | X | Y | Z | |
|---|---|---|---|---|
| States | ||||
| CA | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
| NY | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
| WY | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
| OR | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
| CO | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
Let us go over how to work with Multi-Index, first we'll create a quick example of what a Multi-Indexed DataFrame would look like:
# Index Levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)
hier_index
MultiIndex([('G1', 1),
('G1', 2),
('G1', 3),
('G2', 1),
('G2', 2),
('G2', 3)],
)
df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
df
| A | B | ||
|---|---|---|---|
| G1 | 1 | -0.993263 | 0.196800 |
| 2 | -1.136645 | 0.000366 | |
| 3 | 1.025984 | -0.156598 | |
| G2 | 1 | -0.031579 | 0.649826 |
| 2 | 2.154846 | -0.610259 | |
| 3 | -0.755325 | -0.346419 |
Now let's show how to index this! For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:
df.loc['G1']
| A | B | |
|---|---|---|
| 1 | -0.993263 | 0.196800 |
| 2 | -1.136645 | 0.000366 |
| 3 | 1.025984 | -0.156598 |
df.loc['G1'].loc[1]
A -0.993263 B 0.196800 Name: 1, dtype: float64
df.index.names
FrozenList([None, None])
df.index.names = ['Group','Num']
df
| A | B | ||
|---|---|---|---|
| Group | Num | ||
| G1 | 1 | -0.993263 | 0.196800 |
| 2 | -1.136645 | 0.000366 | |
| 3 | 1.025984 | -0.156598 | |
| G2 | 1 | -0.031579 | 0.649826 |
| 2 | 2.154846 | -0.610259 | |
| 3 | -0.755325 | -0.346419 |
df.xs('G1')
| A | B | |
|---|---|---|
| Num | ||
| 1 | -0.993263 | 0.196800 |
| 2 | -1.136645 | 0.000366 |
| 3 | 1.025984 | -0.156598 |
df.xs(['G1',1])
A -0.993263 B 0.196800 Name: (G1, 1), dtype: float64
df.xs(1,level='Num')
| A | B | |
|---|---|---|
| Group | ||
| G1 | -0.993263 | 0.196800 |
| G2 | -0.031579 | 0.649826 |
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# First argument is mean
# Second argument is standard deviation
# Third argument is size
x = np.random.normal(50, 5, 3000)
x
array([45.27774055, 49.99545773, 54.67905363, ..., 45.46887562,
49.03996203, 44.89792021])
print("Max is {}".format(max(x)))
print("Min is {}".format(min(x)))
print("Max - Min is {}".format(max(x) - min(x)))
Max is 68.56357483500099 Min is 32.13656656048891 Max - Min is 36.42700827451208
sns.distplot(x, kde=False, bins=5,
hist_kws=dict(edgecolor="k", linewidth=2))
plt.xlabel('Intervals')
plt.ylabel('Frequency')
plt.show()
sns.distplot(x, kde=False, bins=10, hist_kws=dict(edgecolor="k", linewidth=2))
plt.xlabel('Intervals')
plt.ylabel('Frequency')
plt.show()
sns.distplot(x, kde=False, bins=15, hist_kws=dict(edgecolor="k", linewidth=2))
plt.xlabel('Intervals')
plt.ylabel('Frequency')
plt.show()
sns.distplot(x, kde=False, bins=50, hist_kws=dict(edgecolor="k", linewidth=2))
plt.xlabel('Intervals')
plt.ylabel('Frequency')
plt.show()
sns.distplot(x, kde=False, bins=30, hist_kws=dict(edgecolor="k", linewidth=2))
plt.xlabel('Intervals')
plt.ylabel('Frequency')
plt.show()
sns.distplot(x, kde=False, bins=30,
norm_hist=True,
hist_kws=dict(edgecolor="k", linewidth=2))
plt.xlabel('Intervals')
plt.ylabel('Relative Frequency')
plt.show()
sns.distplot(x, kde=True, bins=30,
norm_hist=True,
hist_kws=dict(edgecolor="r", linewidth=2))
plt.xticks(range(28, 70, 4))
plt.ylabel('Density')
plt.show()
/home/kashif/anaconda3/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result. return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
plt.axvline(x.mean(), color='b', linestyle='dashed', linewidth=1)
plt.axvline(np.median(x), color='r', linestyle='dashed', linewidth=4)
sns.distplot(x, kde=True, bins=30, hist=False)
plt.ylabel('Density')
plt.show()
np.median(x)
49.99594390121429
np.mean(x)
50.0234060825881
import pandas as pd
import numpy as np
dummy_df = pd.read_csv('dummy_data.csv')
dummy_df
| Sno | Name | Age | Height(cm) | |
|---|---|---|---|---|
| 0 | 1 | John | 25.0 | 160.0 |
| 1 | 2 | Jimmy | 26.0 | 163.0 |
| 2 | 3 | Felicia | 28.0 | 154.0 |
| 3 | 4 | Sophia | NaN | 143.0 |
| 4 | 5 | Bob | NaN | NaN |
| 5 | 6 | Billy | 30.0 | 156.0 |
| 6 | 7 | Kate | 31.0 | 160.0 |
| 7 | 8 | Will | 29.0 | NaN |
| 8 | 9 | Scott | NaN | 148.0 |
There can be multiple reasons for missing values in a dataset
dummy_df.describe()
| Sno | Age | Height(cm) | |
|---|---|---|---|
| count | 9.000000 | 6.000000 | 7.000000 |
| mean | 5.000000 | 28.166667 | 154.857143 |
| std | 2.738613 | 2.316607 | 7.174691 |
| min | 1.000000 | 25.000000 | 143.000000 |
| 25% | 3.000000 | 26.500000 | 151.000000 |
| 50% | 5.000000 | 28.500000 | 156.000000 |
| 75% | 7.000000 | 29.750000 | 160.000000 |
| max | 9.000000 | 31.000000 | 163.000000 |
Here count returns Non-null values
len(dummy_df)
9
string_dummy_df = pd.read_csv('dummy_str_data.csv',index_col=0)
dummy_df.isna()
| Sno | Name | Age | Height(cm) | |
|---|---|---|---|---|
| 0 | False | False | False | False |
| 1 | False | False | False | False |
| 2 | False | False | False | False |
| 3 | False | False | True | False |
| 4 | False | False | True | True |
| 5 | False | False | False | False |
| 6 | False | False | False | False |
| 7 | False | False | False | True |
| 8 | False | False | True | False |
dummy_df.notna()
| Sno | Name | Age | Height(cm) | |
|---|---|---|---|---|
| 0 | True | True | True | True |
| 1 | True | True | True | True |
| 2 | True | True | True | True |
| 3 | True | True | False | True |
| 4 | True | True | False | False |
| 5 | True | True | True | True |
| 6 | True | True | True | True |
| 7 | True | True | True | False |
| 8 | True | True | False | True |
dummy_df.isnull()
dummy_df.isnull().sum()
Sno 0 Name 0 Age 3 Height(cm) 2 dtype: int64
string_dummy_df = pd.read_csv('dummy_str_data.csv', index_col=0)
string_dummy_df
| Device_name | Device_description | Single-Use | |
|---|---|---|---|
| Sno | |||
| 1 | Synringe | Used to inject medicine | True |
| 2 | Ventilator | Used to help patients breath | False |
| 3 | Surgical Gloves | NaN | True |
| 4 | Stethescopes | NaN | NaN |
| 5 | Vials container | NaN | NaN |
time_df = pd.read_csv('dummy_time.csv', index_col=0)
time_df
| Sno | Name | Age | Height(cm) | birthday | |
|---|---|---|---|---|---|
| 0 | 1 | John | 25.0 | 160.0 | 1994-01-01 |
| 1 | 2 | Jimmy | 26.0 | 163.0 | NaN |
| 2 | 3 | Felicia | 28.0 | 154.0 | 1995-01-01 |
| 3 | 4 | Sophia | NaN | 143.0 | NaN |
| 4 | 5 | Bob | NaN | NaN | 1994-01-01 |
| 5 | 6 | Billy | 30.0 | 156.0 | 1994-01-01 |
| 6 | 7 | Kate | 31.0 | 160.0 | 1990-01-01 |
| 7 | 8 | Will | 29.0 | NaN | 1991-07-01 |
| 8 | 9 | Scott | NaN | 148.0 | NaN |
type(time_df['birthday'][0])
str
time_df['birthday'] = pd.to_datetime(time_df['birthday'])
time_df
| Sno | Name | Age | Height(cm) | birthday | |
|---|---|---|---|---|---|
| 0 | 1 | John | 25.0 | 160.0 | 1994-01-01 |
| 1 | 2 | Jimmy | 26.0 | 163.0 | NaT |
| 2 | 3 | Felicia | 28.0 | 154.0 | 1995-01-01 |
| 3 | 4 | Sophia | NaN | 143.0 | NaT |
| 4 | 5 | Bob | NaN | NaN | 1994-01-01 |
| 5 | 6 | Billy | 30.0 | 156.0 | 1994-01-01 |
| 6 | 7 | Kate | 31.0 | 160.0 | 1990-01-01 |
| 7 | 8 | Will | 29.0 | NaN | 1991-07-01 |
| 8 | 9 | Scott | NaN | 148.0 | NaT |
import numpy as np
import scipy.stats
import pandas as pd
dummy_age = [20, 21, 24, 24, 28, 26, 19, 22, 26, 24, 21,
19, 22, 28, 29, 6, 100, 25, 25, 28, 31]
dummy_height = [150, 151, 155, 153, 280, 160, 158, 157, 158, 145, 150,
155, 155, 151, 152, 153, 160, 152, 157, 157, 160, 153]
dummy_df = pd.DataFrame(list(zip(dummy_age, dummy_height)),
columns =['Age', 'Height(cm)'])
dummy_df
| Age | Height(cm) | |
|---|---|---|
| 0 | 20 | 150 |
| 1 | 21 | 151 |
| 2 | 24 | 155 |
| 3 | 24 | 153 |
| 4 | 28 | 280 |
| 5 | 26 | 160 |
| 6 | 19 | 158 |
| 7 | 22 | 157 |
| 8 | 26 | 158 |
| 9 | 24 | 145 |
| 10 | 21 | 150 |
| 11 | 19 | 155 |
| 12 | 22 | 155 |
| 13 | 28 | 151 |
| 14 | 29 | 152 |
| 15 | 6 | 153 |
| 16 | 100 | 160 |
| 17 | 25 | 152 |
| 18 | 25 | 157 |
| 19 | 28 | 157 |
| 20 | 31 | 160 |
def modified_z_score(my_data):
# First Calculate Median
median_my_data = np.median(my_data)
# Median Absolute Deviation
# Median of | X_i - median of X| for all X_i
mad = np.median(my_data.map(lambda x: np.abs(x - median_my_data)))
# Modified Z score
# 0.6745 * (X_i - median of X)/Median Absolute Deviation
modified_z_score = list(my_data.map(lambda x: 0.6745* (x - median_my_data)/mad))
return modified_z_score
modified_z_score(dummy_df['Age'])
[-0.8993333333333333, -0.6745, 0.0, 0.0, 0.8993333333333333, 0.44966666666666666, -1.1241666666666668, -0.44966666666666666, 0.44966666666666666, 0.0, -0.6745, -1.1241666666666668, -0.44966666666666666, 0.8993333333333333, 1.1241666666666668, -4.047, 17.087333333333333, 0.22483333333333333, 0.22483333333333333, 0.8993333333333333, 1.5738333333333332]
mod_z_score_age = modified_z_score(dummy_df['Age'])
dummy_df.iloc[np.where(np.abs(mod_z_score_height)>=3)]
| Age | Height(cm) | |
|---|---|---|
| 15 | 6 | 153 |
| 16 | 100 | 160 |
mod_z_score_height = modified_z_score(dummy_df['Height(cm)'])
dummy_df.iloc[np.where(np.abs(mod_z_score_height)>=3)]
| Age | Height(cm) | |
|---|---|---|
| 4 | 28 | 280 |
This dataset basically includes information regarding all the passengers on Titanic . Various attributes of passengers like age , sex , class ,etc. is recorded and final label 'survived' determines whether or the passenger survived or not .
Survived: Outcome of survival (0 = No; 1 = Yes)
Pclass: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
Name: Name of passenger
Sex: Sex of the passenger
Age: Age of the passenger (Some entries contain NaN)
SibSp: Number of siblings and spouses of the passenger aboard
Parch: Number of parents and children of the passenger aboard
Ticket: Ticket number of the passenger
Fare: Fare paid by the passenger
Cabin: Cabin number of the passenger (Some entries contain NaN)
Embarked: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)
import pandas as pd
df = pd.read_csv('titanic-data.csv', index_col=0)
df.head()
| Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| PassengerId | |||||||||||
| 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
df.describe()
| Survived | Pclass | Age | SibSp | Parch | Fare | |
|---|---|---|---|---|---|---|
| count | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
| mean | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
| std | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
| min | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
| 50% | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
| 75% | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
| max | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
df.columns
Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket',
'Fare', 'Cabin', 'Embarked'],
dtype='object')
df['Age']
PassengerId
1 22.0
2 38.0
3 26.0
4 35.0
5 35.0
6 NaN
7 54.0
8 2.0
9 27.0
10 14.0
11 4.0
12 58.0
13 20.0
14 39.0
15 14.0
16 55.0
17 2.0
18 NaN
19 31.0
20 NaN
21 35.0
22 34.0
23 15.0
24 28.0
25 8.0
26 38.0
27 NaN
28 19.0
29 NaN
30 NaN
...
862 21.0
863 48.0
864 NaN
865 24.0
866 42.0
867 27.0
868 31.0
869 NaN
870 4.0
871 26.0
872 47.0
873 33.0
874 47.0
875 28.0
876 15.0
877 20.0
878 19.0
879 NaN
880 56.0
881 25.0
882 33.0
883 22.0
884 28.0
885 25.0
886 39.0
887 27.0
888 19.0
889 NaN
890 26.0
891 32.0
Name: Age, Length: 891, dtype: float64
# It does not make sense to use unique
df['Age'].unique()
array([22. , 38. , 26. , 35. , nan, 54. , 2. , 27. , 14. ,
4. , 58. , 20. , 39. , 55. , 31. , 34. , 15. , 28. ,
8. , 19. , 40. , 66. , 42. , 21. , 18. , 3. , 7. ,
49. , 29. , 65. , 28.5 , 5. , 11. , 45. , 17. , 32. ,
16. , 25. , 0.83, 30. , 33. , 23. , 24. , 46. , 59. ,
71. , 37. , 47. , 14.5 , 70.5 , 32.5 , 12. , 9. , 36.5 ,
51. , 55.5 , 40.5 , 44. , 1. , 61. , 56. , 50. , 36. ,
45.5 , 20.5 , 62. , 41. , 52. , 63. , 23.5 , 0.92, 43. ,
60. , 10. , 64. , 13. , 48. , 0.75, 53. , 57. , 80. ,
70. , 24.5 , 6. , 0.67, 30.5 , 0.42, 34.5 , 74. ])
df['Fare']
PassengerId
1 7.2500
2 71.2833
3 7.9250
4 53.1000
5 8.0500
6 8.4583
7 51.8625
8 21.0750
9 11.1333
10 30.0708
11 16.7000
12 26.5500
13 8.0500
14 31.2750
15 7.8542
16 16.0000
17 29.1250
18 13.0000
19 18.0000
20 7.2250
21 26.0000
22 13.0000
23 8.0292
24 35.5000
25 21.0750
26 31.3875
27 7.2250
28 263.0000
29 7.8792
30 7.8958
...
862 11.5000
863 25.9292
864 69.5500
865 13.0000
866 13.0000
867 13.8583
868 50.4958
869 9.5000
870 11.1333
871 7.8958
872 52.5542
873 5.0000
874 9.0000
875 24.0000
876 7.2250
877 9.8458
878 7.8958
879 7.8958
880 83.1583
881 26.0000
882 7.8958
883 10.5167
884 10.5000
885 7.0500
886 29.1250
887 13.0000
888 30.0000
889 23.4500
890 30.0000
891 7.7500
Name: Fare, Length: 891, dtype: float64
df['SibSp']
PassengerId
1 1
2 1
3 0
4 1
5 0
6 0
7 0
8 3
9 0
10 1
11 1
12 0
13 0
14 1
15 0
16 0
17 4
18 0
19 1
20 0
21 0
22 0
23 0
24 0
25 3
26 1
27 0
28 3
29 0
30 0
..
862 1
863 0
864 8
865 0
866 0
867 1
868 0
869 0
870 1
871 0
872 1
873 0
874 0
875 1
876 0
877 0
878 0
879 0
880 0
881 0
882 0
883 0
884 0
885 0
886 0
887 0
888 0
889 1
890 0
891 0
Name: SibSp, Length: 891, dtype: int64
df['SibSp'].unique()
array([1, 0, 3, 4, 2, 5, 8])
df['Parch'].unique()
array([0, 1, 2, 5, 3, 4, 6])
df['Sex'].unique()
array(['male', 'female'], dtype=object)
# it does not make sense to use unique on Name
df['Name'].unique()
array(['Braund, Mr. Owen Harris',
'Cumings, Mrs. John Bradley (Florence Briggs Thayer)',
'Heikkinen, Miss. Laina',
'Futrelle, Mrs. Jacques Heath (Lily May Peel)',
'Allen, Mr. William Henry', 'Moran, Mr. James',
'McCarthy, Mr. Timothy J', 'Palsson, Master. Gosta Leonard',
'Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)',
'Nasser, Mrs. Nicholas (Adele Achem)',
'Sandstrom, Miss. Marguerite Rut', 'Bonnell, Miss. Elizabeth',
'Saundercock, Mr. William Henry', 'Andersson, Mr. Anders Johan',
'Vestrom, Miss. Hulda Amanda Adolfina',
'Hewlett, Mrs. (Mary D Kingcome) ', 'Rice, Master. Eugene',
'Williams, Mr. Charles Eugene',
'Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele)',
'Masselmani, Mrs. Fatima', 'Fynney, Mr. Joseph J',
'Beesley, Mr. Lawrence', 'McGowan, Miss. Anna "Annie"',
'Sloper, Mr. William Thompson', 'Palsson, Miss. Torborg Danira',
'Asplund, Mrs. Carl Oscar (Selma Augusta Emilia Johansson)',
'Emir, Mr. Farred Chehab', 'Fortune, Mr. Charles Alexander',
'O\'Dwyer, Miss. Ellen "Nellie"', 'Todoroff, Mr. Lalio',
'Uruchurtu, Don. Manuel E',
'Spencer, Mrs. William Augustus (Marie Eugenie)',
'Glynn, Miss. Mary Agatha', 'Wheadon, Mr. Edward H',
'Meyer, Mr. Edgar Joseph', 'Holverson, Mr. Alexander Oskar',
'Mamee, Mr. Hanna', 'Cann, Mr. Ernest Charles',
'Vander Planke, Miss. Augusta Maria',
'Nicola-Yarred, Miss. Jamila',
'Ahlin, Mrs. Johan (Johanna Persdotter Larsson)',
'Turpin, Mrs. William John Robert (Dorothy Ann Wonnacott)',
'Kraeff, Mr. Theodor', 'Laroche, Miss. Simonne Marie Anne Andree',
'Devaney, Miss. Margaret Delia', 'Rogers, Mr. William John',
'Lennon, Mr. Denis', "O'Driscoll, Miss. Bridget",
'Samaan, Mr. Youssef',
'Arnold-Franchi, Mrs. Josef (Josefine Franchi)',
'Panula, Master. Juha Niilo', 'Nosworthy, Mr. Richard Cater',
'Harper, Mrs. Henry Sleeper (Myna Haxtun)',
'Faunthorpe, Mrs. Lizzie (Elizabeth Anne Wilkinson)',
'Ostby, Mr. Engelhart Cornelius', 'Woolner, Mr. Hugh',
'Rugg, Miss. Emily', 'Novel, Mr. Mansouer',
'West, Miss. Constance Mirium',
'Goodwin, Master. William Frederick', 'Sirayanian, Mr. Orsen',
'Icard, Miss. Amelie', 'Harris, Mr. Henry Birkhardt',
'Skoog, Master. Harald', 'Stewart, Mr. Albert A',
'Moubarek, Master. Gerios', 'Nye, Mrs. (Elizabeth Ramell)',
'Crease, Mr. Ernest James', 'Andersson, Miss. Erna Alexandra',
'Kink, Mr. Vincenz', 'Jenkin, Mr. Stephen Curnow',
'Goodwin, Miss. Lillian Amy', 'Hood, Mr. Ambrose Jr',
'Chronopoulos, Mr. Apostolos', 'Bing, Mr. Lee',
'Moen, Mr. Sigurd Hansen', 'Staneff, Mr. Ivan',
'Moutal, Mr. Rahamin Haim', 'Caldwell, Master. Alden Gates',
'Dowdell, Miss. Elizabeth', 'Waelens, Mr. Achille',
'Sheerlinck, Mr. Jan Baptist', 'McDermott, Miss. Brigdet Delia',
'Carrau, Mr. Francisco M', 'Ilett, Miss. Bertha',
'Backstrom, Mrs. Karl Alfred (Maria Mathilda Gustafsson)',
'Ford, Mr. William Neal', 'Slocovski, Mr. Selman Francis',
'Fortune, Miss. Mabel Helen', 'Celotti, Mr. Francesco',
'Christmann, Mr. Emil', 'Andreasson, Mr. Paul Edvin',
'Chaffee, Mr. Herbert Fuller', 'Dean, Mr. Bertram Frank',
'Coxon, Mr. Daniel', 'Shorney, Mr. Charles Joseph',
'Goldschmidt, Mr. George B', 'Greenfield, Mr. William Bertram',
'Doling, Mrs. John T (Ada Julia Bone)', 'Kantor, Mr. Sinai',
'Petranec, Miss. Matilda', 'Petroff, Mr. Pastcho ("Pentcho")',
'White, Mr. Richard Frasar', 'Johansson, Mr. Gustaf Joel',
'Gustafsson, Mr. Anders Vilhelm', 'Mionoff, Mr. Stoytcho',
'Salkjelsvik, Miss. Anna Kristine', 'Moss, Mr. Albert Johan',
'Rekic, Mr. Tido', 'Moran, Miss. Bertha',
'Porter, Mr. Walter Chamberlain', 'Zabour, Miss. Hileni',
'Barton, Mr. David John', 'Jussila, Miss. Katriina',
'Attalah, Miss. Malake', 'Pekoniemi, Mr. Edvard',
'Connors, Mr. Patrick', 'Turpin, Mr. William John Robert',
'Baxter, Mr. Quigg Edmond', 'Andersson, Miss. Ellis Anna Maria',
'Hickman, Mr. Stanley George', 'Moore, Mr. Leonard Charles',
'Nasser, Mr. Nicholas', 'Webber, Miss. Susan',
'White, Mr. Percival Wayland', 'Nicola-Yarred, Master. Elias',
'McMahon, Mr. Martin', 'Madsen, Mr. Fridtjof Arne',
'Peter, Miss. Anna', 'Ekstrom, Mr. Johan', 'Drazenoic, Mr. Jozef',
'Coelho, Mr. Domingos Fernandeo',
'Robins, Mrs. Alexander A (Grace Charity Laury)',
'Weisz, Mrs. Leopold (Mathilde Francoise Pede)',
'Sobey, Mr. Samuel James Hayden', 'Richard, Mr. Emile',
'Newsom, Miss. Helen Monypeny', 'Futrelle, Mr. Jacques Heath',
'Osen, Mr. Olaf Elon', 'Giglio, Mr. Victor',
'Boulos, Mrs. Joseph (Sultana)', 'Nysten, Miss. Anna Sofia',
'Hakkarainen, Mrs. Pekka Pietari (Elin Matilda Dolck)',
'Burke, Mr. Jeremiah', 'Andrew, Mr. Edgardo Samuel',
'Nicholls, Mr. Joseph Charles',
'Andersson, Mr. August Edvard ("Wennerstrom")',
'Ford, Miss. Robina Maggie "Ruby"',
'Navratil, Mr. Michel ("Louis M Hoffman")',
'Byles, Rev. Thomas Roussel Davids', 'Bateman, Rev. Robert James',
'Pears, Mrs. Thomas (Edith Wearne)', 'Meo, Mr. Alfonzo',
'van Billiard, Mr. Austin Blyler', 'Olsen, Mr. Ole Martin',
'Williams, Mr. Charles Duane', 'Gilnagh, Miss. Katherine "Katie"',
'Corn, Mr. Harry', 'Smiljanic, Mr. Mile',
'Sage, Master. Thomas Henry', 'Cribb, Mr. John Hatfield',
'Watt, Mrs. James (Elizabeth "Bessie" Inglis Milne)',
'Bengtsson, Mr. John Viktor', 'Calic, Mr. Jovo',
'Panula, Master. Eino Viljami',
'Goldsmith, Master. Frank John William "Frankie"',
'Chibnall, Mrs. (Edith Martha Bowerman)',
'Skoog, Mrs. William (Anna Bernhardina Karlsson)',
'Baumann, Mr. John D', 'Ling, Mr. Lee',
'Van der hoef, Mr. Wyckoff', 'Rice, Master. Arthur',
'Johnson, Miss. Eleanor Ileen', 'Sivola, Mr. Antti Wilhelm',
'Smith, Mr. James Clinch', 'Klasen, Mr. Klas Albin',
'Lefebre, Master. Henry Forbes', 'Isham, Miss. Ann Elizabeth',
'Hale, Mr. Reginald', 'Leonard, Mr. Lionel',
'Sage, Miss. Constance Gladys', 'Pernot, Mr. Rene',
'Asplund, Master. Clarence Gustaf Hugo',
'Becker, Master. Richard F', 'Kink-Heilmann, Miss. Luise Gretchen',
'Rood, Mr. Hugh Roscoe',
'O\'Brien, Mrs. Thomas (Johanna "Hannah" Godfrey)',
'Romaine, Mr. Charles Hallace ("Mr C Rolmane")',
'Bourke, Mr. John', 'Turcin, Mr. Stjepan', 'Pinsky, Mrs. (Rosa)',
'Carbines, Mr. William',
'Andersen-Jensen, Miss. Carla Christine Nielsine',
'Navratil, Master. Michel M',
'Brown, Mrs. James Joseph (Margaret Tobin)',
'Lurette, Miss. Elise', 'Mernagh, Mr. Robert',
'Olsen, Mr. Karl Siegwart Andreas',
'Madigan, Miss. Margaret "Maggie"',
'Yrois, Miss. Henriette ("Mrs Harbeck")',
'Vande Walle, Mr. Nestor Cyriel', 'Sage, Mr. Frederick',
'Johanson, Mr. Jakob Alfred', 'Youseff, Mr. Gerious',
'Cohen, Mr. Gurshon "Gus"', 'Strom, Miss. Telma Matilda',
'Backstrom, Mr. Karl Alfred', 'Albimona, Mr. Nassef Cassem',
'Carr, Miss. Helen "Ellen"', 'Blank, Mr. Henry', 'Ali, Mr. Ahmed',
'Cameron, Miss. Clear Annie', 'Perkin, Mr. John Henry',
'Givard, Mr. Hans Kristensen', 'Kiernan, Mr. Philip',
'Newell, Miss. Madeleine', 'Honkanen, Miss. Eliina',
'Jacobsohn, Mr. Sidney Samuel', 'Bazzani, Miss. Albina',
'Harris, Mr. Walter', 'Sunderland, Mr. Victor Francis',
'Bracken, Mr. James H', 'Green, Mr. George Henry',
'Nenkoff, Mr. Christo', 'Hoyt, Mr. Frederick Maxfield',
'Berglund, Mr. Karl Ivar Sven', 'Mellors, Mr. William John',
'Lovell, Mr. John Hall ("Henry")', 'Fahlstrom, Mr. Arne Jonas',
'Lefebre, Miss. Mathilde',
'Harris, Mrs. Henry Birkhardt (Irene Wallach)',
'Larsson, Mr. Bengt Edvin', 'Sjostedt, Mr. Ernst Adolf',
'Asplund, Miss. Lillian Gertrud',
'Leyson, Mr. Robert William Norman',
'Harknett, Miss. Alice Phoebe', 'Hold, Mr. Stephen',
'Collyer, Miss. Marjorie "Lottie"',
'Pengelly, Mr. Frederick William', 'Hunt, Mr. George Henry',
'Zabour, Miss. Thamine', 'Murphy, Miss. Katherine "Kate"',
'Coleridge, Mr. Reginald Charles', 'Maenpaa, Mr. Matti Alexanteri',
'Attalah, Mr. Sleiman', 'Minahan, Dr. William Edward',
'Lindahl, Miss. Agda Thorilda Viktoria',
'Hamalainen, Mrs. William (Anna)', 'Beckwith, Mr. Richard Leonard',
'Carter, Rev. Ernest Courtenay', 'Reed, Mr. James George',
'Strom, Mrs. Wilhelm (Elna Matilda Persson)',
'Stead, Mr. William Thomas', 'Lobb, Mr. William Arthur',
'Rosblom, Mrs. Viktor (Helena Wilhelmina)',
'Touma, Mrs. Darwis (Hanne Youssef Razi)',
'Thorne, Mrs. Gertrude Maybelle', 'Cherry, Miss. Gladys',
'Ward, Miss. Anna', 'Parrish, Mrs. (Lutie Davis)',
'Smith, Mr. Thomas', 'Asplund, Master. Edvin Rojj Felix',
'Taussig, Mr. Emil', 'Harrison, Mr. William', 'Henry, Miss. Delia',
'Reeves, Mr. David', 'Panula, Mr. Ernesti Arvid',
'Persson, Mr. Ernst Ulrik',
'Graham, Mrs. William Thompson (Edith Junkins)',
'Bissette, Miss. Amelia', 'Cairns, Mr. Alexander',
'Tornquist, Mr. William Henry',
'Mellinger, Mrs. (Elizabeth Anne Maidment)',
'Natsch, Mr. Charles H', 'Healy, Miss. Hanora "Nora"',
'Andrews, Miss. Kornelia Theodosia',
'Lindblom, Miss. Augusta Charlotta', 'Parkes, Mr. Francis "Frank"',
'Rice, Master. Eric', 'Abbott, Mrs. Stanton (Rosa Hunt)',
'Duane, Mr. Frank', 'Olsson, Mr. Nils Johan Goransson',
'de Pelsmaeker, Mr. Alfons', 'Dorking, Mr. Edward Arthur',
'Smith, Mr. Richard William', 'Stankovic, Mr. Ivan',
'de Mulder, Mr. Theodore', 'Naidenoff, Mr. Penko',
'Hosono, Mr. Masabumi', 'Connolly, Miss. Kate',
'Barber, Miss. Ellen "Nellie"',
'Bishop, Mrs. Dickinson H (Helen Walton)',
'Levy, Mr. Rene Jacques', 'Haas, Miss. Aloisia',
'Mineff, Mr. Ivan', 'Lewy, Mr. Ervin G', 'Hanna, Mr. Mansour',
'Allison, Miss. Helen Loraine', 'Saalfeld, Mr. Adolphe',
'Baxter, Mrs. James (Helene DeLaudeniere Chaput)',
'Kelly, Miss. Anna Katherine "Annie Kate"', 'McCoy, Mr. Bernard',
'Johnson, Mr. William Cahoone Jr', 'Keane, Miss. Nora A',
'Williams, Mr. Howard Hugh "Harry"',
'Allison, Master. Hudson Trevor', 'Fleming, Miss. Margaret',
'Penasco y Castellana, Mrs. Victor de Satode (Maria Josefa Perez de Soto y Vallejo)',
'Abelson, Mr. Samuel', 'Francatelli, Miss. Laura Mabel',
'Hays, Miss. Margaret Bechstein', 'Ryerson, Miss. Emily Borie',
'Lahtinen, Mrs. William (Anna Sylfven)', 'Hendekovic, Mr. Ignjac',
'Hart, Mr. Benjamin', 'Nilsson, Miss. Helmina Josefina',
'Kantor, Mrs. Sinai (Miriam Sternin)', 'Moraweck, Dr. Ernest',
'Wick, Miss. Mary Natalie',
'Spedden, Mrs. Frederic Oakley (Margaretta Corning Stone)',
'Dennis, Mr. Samuel', 'Danoff, Mr. Yoto',
'Slayter, Miss. Hilda Mary',
'Caldwell, Mrs. Albert Francis (Sylvia Mae Harbaugh)',
'Sage, Mr. George John Jr', 'Young, Miss. Marie Grice',
'Nysveen, Mr. Johan Hansen', 'Ball, Mrs. (Ada E Hall)',
'Goldsmith, Mrs. Frank John (Emily Alice Brown)',
'Hippach, Miss. Jean Gertrude', 'McCoy, Miss. Agnes',
'Partner, Mr. Austen', 'Graham, Mr. George Edward',
'Vander Planke, Mr. Leo Edmondus',
'Frauenthal, Mrs. Henry William (Clara Heinsheimer)',
'Denkoff, Mr. Mitto', 'Pears, Mr. Thomas Clinton',
'Burns, Miss. Elizabeth Margaret', 'Dahl, Mr. Karl Edwart',
'Blackwell, Mr. Stephen Weart', 'Navratil, Master. Edmond Roger',
'Fortune, Miss. Alice Elizabeth', 'Collander, Mr. Erik Gustaf',
'Sedgwick, Mr. Charles Frederick Waddington',
'Fox, Mr. Stanley Hubert', 'Brown, Miss. Amelia "Mildred"',
'Smith, Miss. Marion Elsie',
'Davison, Mrs. Thomas Henry (Mary E Finck)',
'Coutts, Master. William Loch "William"', 'Dimic, Mr. Jovan',
'Odahl, Mr. Nils Martin', 'Williams-Lambert, Mr. Fletcher Fellows',
'Elias, Mr. Tannous', 'Arnold-Franchi, Mr. Josef',
'Yousif, Mr. Wazli', 'Vanden Steen, Mr. Leo Peter',
'Bowerman, Miss. Elsie Edith', 'Funk, Miss. Annie Clemmer',
'McGovern, Miss. Mary', 'Mockler, Miss. Helen Mary "Ellie"',
'Skoog, Mr. Wilhelm', 'del Carlo, Mr. Sebastiano',
'Barbara, Mrs. (Catherine David)', 'Asim, Mr. Adola',
"O'Brien, Mr. Thomas", 'Adahl, Mr. Mauritz Nils Martin',
'Warren, Mrs. Frank Manley (Anna Sophia Atkinson)',
'Moussa, Mrs. (Mantoura Boulos)', 'Jermyn, Miss. Annie',
'Aubart, Mme. Leontine Pauline', 'Harder, Mr. George Achilles',
'Wiklund, Mr. Jakob Alfred', 'Beavan, Mr. William Thomas',
'Ringhini, Mr. Sante', 'Palsson, Miss. Stina Viola',
'Meyer, Mrs. Edgar Joseph (Leila Saks)',
'Landergren, Miss. Aurora Adelia', 'Widener, Mr. Harry Elkins',
'Betros, Mr. Tannous', 'Gustafsson, Mr. Karl Gideon',
'Bidois, Miss. Rosalie', 'Nakid, Miss. Maria ("Mary")',
'Tikkanen, Mr. Juho',
'Holverson, Mrs. Alexander Oskar (Mary Aline Towner)',
'Plotcharsky, Mr. Vasil', 'Davies, Mr. Charles Henry',
'Goodwin, Master. Sidney Leonard', 'Buss, Miss. Kate',
'Sadlier, Mr. Matthew', 'Lehmann, Miss. Bertha',
'Carter, Mr. William Ernest', 'Jansson, Mr. Carl Olof',
'Gustafsson, Mr. Johan Birger', 'Newell, Miss. Marjorie',
'Sandstrom, Mrs. Hjalmar (Agnes Charlotta Bengtsson)',
'Johansson, Mr. Erik', 'Olsson, Miss. Elina',
'McKane, Mr. Peter David', 'Pain, Dr. Alfred',
'Trout, Mrs. William H (Jessie L)', 'Niskanen, Mr. Juha',
'Adams, Mr. John', 'Jussila, Miss. Mari Aina',
'Hakkarainen, Mr. Pekka Pietari', 'Oreskovic, Miss. Marija',
'Gale, Mr. Shadrach', 'Widegren, Mr. Carl/Charles Peter',
'Richards, Master. William Rowe',
'Birkeland, Mr. Hans Martin Monsen', 'Lefebre, Miss. Ida',
'Sdycoff, Mr. Todor', 'Hart, Mr. Henry', 'Minahan, Miss. Daisy E',
'Cunningham, Mr. Alfred Fleming', 'Sundman, Mr. Johan Julian',
'Meek, Mrs. Thomas (Annie Louise Rowley)',
'Drew, Mrs. James Vivian (Lulu Thorne Christian)',
'Silven, Miss. Lyyli Karoliina', 'Matthews, Mr. William John',
'Van Impe, Miss. Catharina', 'Gheorgheff, Mr. Stanio',
'Charters, Mr. David', 'Zimmerman, Mr. Leo',
'Danbom, Mrs. Ernst Gilbert (Anna Sigrid Maria Brogren)',
'Rosblom, Mr. Viktor Richard', 'Wiseman, Mr. Phillippe',
'Clarke, Mrs. Charles V (Ada Maria Winfield)',
'Phillips, Miss. Kate Florence ("Mrs Kate Louise Phillips Marshall")',
'Flynn, Mr. James', 'Pickard, Mr. Berk (Berk Trembisky)',
'Bjornstrom-Steffansson, Mr. Mauritz Hakan',
'Thorneycroft, Mrs. Percival (Florence Kate White)',
'Louch, Mrs. Charles Alexander (Alice Adelaide Slow)',
'Kallio, Mr. Nikolai Erland', 'Silvey, Mr. William Baird',
'Carter, Miss. Lucile Polk',
'Ford, Miss. Doolina Margaret "Daisy"',
'Richards, Mrs. Sidney (Emily Hocking)', 'Fortune, Mr. Mark',
'Kvillner, Mr. Johan Henrik Johannesson',
'Hart, Mrs. Benjamin (Esther Ada Bloomfield)', 'Hampe, Mr. Leon',
'Petterson, Mr. Johan Emil', 'Reynaldo, Ms. Encarnacion',
'Johannesen-Bratthammer, Mr. Bernt', 'Dodge, Master. Washington',
'Mellinger, Miss. Madeleine Violet', 'Seward, Mr. Frederic Kimber',
'Baclini, Miss. Marie Catherine', 'Peuchen, Major. Arthur Godfrey',
'West, Mr. Edwy Arthur', 'Hagland, Mr. Ingvald Olai Olsen',
'Foreman, Mr. Benjamin Laventall', 'Goldenberg, Mr. Samuel L',
'Peduzzi, Mr. Joseph', 'Jalsevac, Mr. Ivan',
'Millet, Mr. Francis Davis', 'Kenyon, Mrs. Frederick R (Marion)',
'Toomey, Miss. Ellen', "O'Connor, Mr. Maurice",
'Anderson, Mr. Harry', 'Morley, Mr. William', 'Gee, Mr. Arthur H',
'Milling, Mr. Jacob Christian', 'Maisner, Mr. Simon',
'Goncalves, Mr. Manuel Estanslas', 'Campbell, Mr. William',
'Smart, Mr. John Montgomery', 'Scanlan, Mr. James',
'Baclini, Miss. Helene Barbara', 'Keefe, Mr. Arthur',
'Cacic, Mr. Luka', 'West, Mrs. Edwy Arthur (Ada Mary Worth)',
'Jerwan, Mrs. Amin S (Marie Marthe Thuillard)',
'Strandberg, Miss. Ida Sofia', 'Clifford, Mr. George Quincy',
'Renouf, Mr. Peter Henry', 'Braund, Mr. Lewis Richard',
'Karlsson, Mr. Nils August', 'Hirvonen, Miss. Hildur E',
'Goodwin, Master. Harold Victor',
'Frost, Mr. Anthony Wood "Archie"', 'Rouse, Mr. Richard Henry',
'Turkula, Mrs. (Hedwig)', 'Bishop, Mr. Dickinson H',
'Lefebre, Miss. Jeannie',
'Hoyt, Mrs. Frederick Maxfield (Jane Anne Forby)',
'Kent, Mr. Edward Austin', 'Somerton, Mr. Francis William',
'Coutts, Master. Eden Leslie "Neville"',
'Hagland, Mr. Konrad Mathias Reiersen', 'Windelov, Mr. Einar',
'Molson, Mr. Harry Markland', 'Artagaveytia, Mr. Ramon',
'Stanley, Mr. Edward Roland', 'Yousseff, Mr. Gerious',
'Eustis, Miss. Elizabeth Mussey',
'Shellard, Mr. Frederick William',
'Allison, Mrs. Hudson J C (Bessie Waldo Daniels)',
'Svensson, Mr. Olof', 'Calic, Mr. Petar', 'Canavan, Miss. Mary',
"O'Sullivan, Miss. Bridget Mary", 'Laitinen, Miss. Kristina Sofia',
'Maioni, Miss. Roberta',
'Penasco y Castellana, Mr. Victor de Satode',
'Quick, Mrs. Frederick Charles (Jane Richards)',
'Bradley, Mr. George ("George Arthur Brayton")',
'Olsen, Mr. Henry Margido', 'Lang, Mr. Fang',
'Daly, Mr. Eugene Patrick', 'Webber, Mr. James',
'McGough, Mr. James Robert',
'Rothschild, Mrs. Martin (Elizabeth L. Barrett)',
'Coleff, Mr. Satio', 'Walker, Mr. William Anderson',
'Lemore, Mrs. (Amelia Milley)', 'Ryan, Mr. Patrick',
'Angle, Mrs. William A (Florence "Mary" Agnes Hughes)',
'Pavlovic, Mr. Stefo', 'Perreault, Miss. Anne', 'Vovk, Mr. Janko',
'Lahoud, Mr. Sarkis',
'Hippach, Mrs. Louis Albert (Ida Sophia Fischer)',
'Kassem, Mr. Fared', 'Farrell, Mr. James', 'Ridsdale, Miss. Lucy',
'Farthing, Mr. John', 'Salonen, Mr. Johan Werner',
'Hocking, Mr. Richard George', 'Quick, Miss. Phyllis May',
'Toufik, Mr. Nakli', 'Elias, Mr. Joseph Jr',
'Peter, Mrs. Catherine (Catherine Rizk)', 'Cacic, Miss. Marija',
'Hart, Miss. Eva Miriam', 'Butt, Major. Archibald Willingham',
'LeRoy, Miss. Bertha', 'Risien, Mr. Samuel Beard',
'Frolicher, Miss. Hedwig Margaritha', 'Crosby, Miss. Harriet R',
'Andersson, Miss. Ingeborg Constanzia',
'Andersson, Miss. Sigrid Elisabeth', 'Beane, Mr. Edward',
'Douglas, Mr. Walter Donald', 'Nicholson, Mr. Arthur Ernest',
'Beane, Mrs. Edward (Ethel Clarke)', 'Padro y Manent, Mr. Julian',
'Goldsmith, Mr. Frank John', 'Davies, Master. John Morgan Jr',
'Thayer, Mr. John Borland Jr', 'Sharp, Mr. Percival James R',
"O'Brien, Mr. Timothy", 'Leeni, Mr. Fahim ("Philip Zenni")',
'Ohman, Miss. Velin', 'Wright, Mr. George',
'Duff Gordon, Lady. (Lucille Christiana Sutherland) ("Mrs Morgan")',
'Robbins, Mr. Victor', 'Taussig, Mrs. Emil (Tillie Mandelbaum)',
'de Messemaeker, Mrs. Guillaume Joseph (Emma)',
'Morrow, Mr. Thomas Rowan', 'Sivic, Mr. Husein',
'Norman, Mr. Robert Douglas', 'Simmons, Mr. John',
'Meanwell, Miss. (Marion Ogden)', 'Davies, Mr. Alfred J',
'Stoytcheff, Mr. Ilia',
'Palsson, Mrs. Nils (Alma Cornelia Berglund)',
'Doharr, Mr. Tannous', 'Jonsson, Mr. Carl', 'Harris, Mr. George',
'Appleton, Mrs. Edward Dale (Charlotte Lamson)',
'Flynn, Mr. John Irwin ("Irving")', 'Kelly, Miss. Mary',
'Rush, Mr. Alfred George John', 'Patchett, Mr. George',
'Garside, Miss. Ethel',
'Silvey, Mrs. William Baird (Alice Munger)',
'Caram, Mrs. Joseph (Maria Elias)', 'Jussila, Mr. Eiriik',
'Christy, Miss. Julie Rachel',
'Thayer, Mrs. John Borland (Marian Longstreth Morris)',
'Downton, Mr. William James', 'Ross, Mr. John Hugo',
'Paulner, Mr. Uscher', 'Taussig, Miss. Ruth',
'Jarvis, Mr. John Denzil', 'Frolicher-Stehli, Mr. Maxmillian',
'Gilinski, Mr. Eliezer', 'Murdlin, Mr. Joseph',
'Rintamaki, Mr. Matti',
'Stephenson, Mrs. Walter Bertram (Martha Eustis)',
'Elsbury, Mr. William James', 'Bourke, Miss. Mary',
'Chapman, Mr. John Henry', 'Van Impe, Mr. Jean Baptiste',
'Leitch, Miss. Jessie Wills', 'Johnson, Mr. Alfred',
'Boulos, Mr. Hanna',
'Duff Gordon, Sir. Cosmo Edmund ("Mr Morgan")',
'Jacobsohn, Mrs. Sidney Samuel (Amy Frances Christy)',
'Slabenoff, Mr. Petco', 'Harrington, Mr. Charles H',
'Torber, Mr. Ernst William', 'Homer, Mr. Harry ("Mr E Haven")',
'Lindell, Mr. Edvard Bengtsson', 'Karaic, Mr. Milan',
'Daniel, Mr. Robert Williams',
'Laroche, Mrs. Joseph (Juliette Marie Louise Lafargue)',
'Shutes, Miss. Elizabeth W',
'Andersson, Mrs. Anders Johan (Alfrida Konstantia Brogren)',
'Jardin, Mr. Jose Neto', 'Murphy, Miss. Margaret Jane',
'Horgan, Mr. John', 'Brocklebank, Mr. William Alfred',
'Herman, Miss. Alice', 'Danbom, Mr. Ernst Gilbert',
'Lobb, Mrs. William Arthur (Cordelia K Stanlick)',
'Becker, Miss. Marion Louise', 'Gavey, Mr. Lawrence',
'Yasbeck, Mr. Antoni', 'Kimball, Mr. Edwin Nelson Jr',
'Nakid, Mr. Sahid', 'Hansen, Mr. Henry Damsgaard',
'Bowen, Mr. David John "Dai"', 'Sutton, Mr. Frederick',
'Kirkland, Rev. Charles Leonard', 'Longley, Miss. Gretchen Fiske',
'Bostandyeff, Mr. Guentcho', "O'Connell, Mr. Patrick D",
'Barkworth, Mr. Algernon Henry Wilson',
'Lundahl, Mr. Johan Svensson', 'Stahelin-Maeglin, Dr. Max',
'Parr, Mr. William Henry Marsh', 'Skoog, Miss. Mabel',
'Davis, Miss. Mary', 'Leinonen, Mr. Antti Gustaf',
'Collyer, Mr. Harvey', 'Panula, Mrs. Juha (Maria Emilia Ojala)',
'Thorneycroft, Mr. Percival', 'Jensen, Mr. Hans Peder',
'Sagesser, Mlle. Emma', 'Skoog, Miss. Margit Elizabeth',
'Foo, Mr. Choong', 'Baclini, Miss. Eugenie',
'Harper, Mr. Henry Sleeper', 'Cor, Mr. Liudevit',
'Simonius-Blumer, Col. Oberst Alfons', 'Willey, Mr. Edward',
'Stanley, Miss. Amy Zillah Elsie', 'Mitkoff, Mr. Mito',
'Doling, Miss. Elsie', 'Kalvik, Mr. Johannes Halvorsen',
'O\'Leary, Miss. Hanora "Norah"', 'Hegarty, Miss. Hanora "Nora"',
'Hickman, Mr. Leonard Mark', 'Radeff, Mr. Alexander',
'Bourke, Mrs. John (Catherine)', 'Eitemiller, Mr. George Floyd',
'Newell, Mr. Arthur Webster', 'Frauenthal, Dr. Henry William',
'Badt, Mr. Mohamed', 'Colley, Mr. Edward Pomeroy',
'Coleff, Mr. Peju', 'Lindqvist, Mr. Eino William',
'Hickman, Mr. Lewis', 'Butler, Mr. Reginald Fenton',
'Rommetvedt, Mr. Knud Paust', 'Cook, Mr. Jacob',
'Taylor, Mrs. Elmer Zebley (Juliet Cummins Wright)',
'Brown, Mrs. Thomas William Solomon (Elizabeth Catherine Ford)',
'Davidson, Mr. Thornton', 'Mitchell, Mr. Henry Michael',
'Wilhelms, Mr. Charles', 'Watson, Mr. Ennis Hastings',
'Edvardsson, Mr. Gustaf Hjalmar', 'Sawyer, Mr. Frederick Charles',
'Turja, Miss. Anna Sofia',
'Goodwin, Mrs. Frederick (Augusta Tyler)',
'Cardeza, Mr. Thomas Drake Martinez', 'Peters, Miss. Katie',
'Hassab, Mr. Hammad', 'Olsvigen, Mr. Thor Anderson',
'Goodwin, Mr. Charles Edward', 'Brown, Mr. Thomas William Solomon',
'Laroche, Mr. Joseph Philippe Lemercier',
'Panula, Mr. Jaako Arnold', 'Dakic, Mr. Branko',
'Fischer, Mr. Eberhard Thelander',
'Madill, Miss. Georgette Alexandra', 'Dick, Mr. Albert Adrian',
'Karun, Miss. Manca', 'Lam, Mr. Ali', 'Saad, Mr. Khalil',
'Weir, Col. John', 'Chapman, Mr. Charles Henry',
'Kelly, Mr. James', 'Mullens, Miss. Katherine "Katie"',
'Thayer, Mr. John Borland',
'Humblen, Mr. Adolf Mathias Nicolai Olsen',
'Astor, Mrs. John Jacob (Madeleine Talmadge Force)',
'Silverthorne, Mr. Spencer Victor', 'Barbara, Miss. Saiide',
'Gallagher, Mr. Martin', 'Hansen, Mr. Henrik Juul',
'Morley, Mr. Henry Samuel ("Mr Henry Marshall")',
'Kelly, Mrs. Florence "Fannie"',
'Calderhead, Mr. Edward Pennington', 'Cleaver, Miss. Alice',
'Moubarek, Master. Halim Gonios ("William George")',
'Mayne, Mlle. Berthe Antonine ("Mrs de Villiers")',
'Klaber, Mr. Herman', 'Taylor, Mr. Elmer Zebley',
'Larsson, Mr. August Viktor', 'Greenberg, Mr. Samuel',
'Soholt, Mr. Peter Andreas Lauritz Andersen',
'Endres, Miss. Caroline Louise',
'Troutt, Miss. Edwina Celia "Winnie"', 'McEvoy, Mr. Michael',
'Johnson, Mr. Malkolm Joackim',
'Harper, Miss. Annie Jessie "Nina"', 'Jensen, Mr. Svend Lauritz',
'Gillespie, Mr. William Henry', 'Hodges, Mr. Henry Price',
'Chambers, Mr. Norman Campbell', 'Oreskovic, Mr. Luka',
'Renouf, Mrs. Peter Henry (Lillian Jefferys)',
'Mannion, Miss. Margareth', 'Bryhl, Mr. Kurt Arnold Gottfrid',
'Ilmakangas, Miss. Pieta Sofia', 'Allen, Miss. Elisabeth Walton',
'Hassan, Mr. Houssein G N', 'Knight, Mr. Robert J',
'Berriman, Mr. William John', 'Troupiansky, Mr. Moses Aaron',
'Williams, Mr. Leslie', 'Ford, Mrs. Edward (Margaret Ann Watson)',
'Lesurer, Mr. Gustave J', 'Ivanoff, Mr. Kanio',
'Nankoff, Mr. Minko', 'Hawksford, Mr. Walter James',
'Cavendish, Mr. Tyrell William',
'Ryerson, Miss. Susan Parker "Suzette"', 'McNamee, Mr. Neal',
'Stranden, Mr. Juho', 'Crosby, Capt. Edward Gifford',
'Abbott, Mr. Rossmore Edward', 'Sinkkonen, Miss. Anna',
'Marvin, Mr. Daniel Warner', 'Connaghton, Mr. Michael',
'Wells, Miss. Joan', 'Moor, Master. Meier',
'Vande Velde, Mr. Johannes Joseph', 'Jonkoff, Mr. Lalio',
'Herman, Mrs. Samuel (Jane Laver)', 'Hamalainen, Master. Viljo',
'Carlsson, Mr. August Sigfrid', 'Bailey, Mr. Percy Andrew',
'Theobald, Mr. Thomas Leonard',
'Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards)',
'Garfirth, Mr. John', 'Nirva, Mr. Iisakki Antino Aijo',
'Barah, Mr. Hanna Assi',
'Carter, Mrs. William Ernest (Lucile Polk)',
'Eklund, Mr. Hans Linus', 'Hogeboom, Mrs. John C (Anna Andrews)',
'Brewe, Dr. Arthur Jackson', 'Mangan, Miss. Mary',
'Moran, Mr. Daniel J', 'Gronnestad, Mr. Daniel Danielsen',
'Lievens, Mr. Rene Aime', 'Jensen, Mr. Niels Peder',
'Mack, Mrs. (Mary)', 'Elias, Mr. Dibo',
'Hocking, Mrs. Elizabeth (Eliza Needs)',
'Myhrman, Mr. Pehr Fabian Oliver Malkolm', 'Tobin, Mr. Roger',
'Emanuel, Miss. Virginia Ethel', 'Kilgannon, Mr. Thomas J',
'Robert, Mrs. Edward Scott (Elisabeth Walton McMillan)',
'Ayoub, Miss. Banoura',
'Dick, Mrs. Albert Adrian (Vera Gillespie)',
'Long, Mr. Milton Clyde', 'Johnston, Mr. Andrew G',
'Ali, Mr. William', 'Harmer, Mr. Abraham (David Lishin)',
'Sjoblom, Miss. Anna Sofia', 'Rice, Master. George Hugh',
'Dean, Master. Bertram Vere', 'Guggenheim, Mr. Benjamin',
'Keane, Mr. Andrew "Andy"', 'Gaskell, Mr. Alfred',
'Sage, Miss. Stella Anna', 'Hoyt, Mr. William Fisher',
'Dantcheff, Mr. Ristiu', 'Otter, Mr. Richard',
'Leader, Dr. Alice (Farnham)', 'Osman, Mrs. Mara',
'Ibrahim Shawah, Mr. Yousseff',
'Van Impe, Mrs. Jean Baptiste (Rosalie Paula Govaert)',
'Ponesell, Mr. Martin',
'Collyer, Mrs. Harvey (Charlotte Annie Tate)',
'Carter, Master. William Thornton II',
'Thomas, Master. Assad Alexander', 'Hedman, Mr. Oskar Arvid',
'Johansson, Mr. Karl Johan', 'Andrews, Mr. Thomas Jr',
'Pettersson, Miss. Ellen Natalia', 'Meyer, Mr. August',
'Chambers, Mrs. Norman Campbell (Bertha Griggs)',
'Alexander, Mr. William', 'Lester, Mr. James',
'Slemen, Mr. Richard James', 'Andersson, Miss. Ebba Iris Alfrida',
'Tomlin, Mr. Ernest Portage', 'Fry, Mr. Richard',
'Heininen, Miss. Wendla Maria', 'Mallet, Mr. Albert',
'Holm, Mr. John Fredrik Alexander', 'Skoog, Master. Karl Thorsten',
'Hays, Mrs. Charles Melville (Clara Jennings Gregg)',
'Lulic, Mr. Nikola', 'Reuchlin, Jonkheer. John George',
'Moor, Mrs. (Beila)', 'Panula, Master. Urho Abraham',
'Flynn, Mr. John', 'Lam, Mr. Len', 'Mallet, Master. Andre',
'McCormack, Mr. Thomas Joseph',
'Stone, Mrs. George Nelson (Martha Evelyn)',
'Yasbeck, Mrs. Antoni (Selini Alexander)',
'Richards, Master. George Sibley', 'Saad, Mr. Amin',
'Augustsson, Mr. Albert', 'Allum, Mr. Owen George',
'Compton, Miss. Sara Rebecca', 'Pasic, Mr. Jakob',
'Sirota, Mr. Maurice', 'Chip, Mr. Chang', 'Marechal, Mr. Pierre',
'Alhomaki, Mr. Ilmari Rudolf', 'Mudd, Mr. Thomas Charles',
'Serepeca, Miss. Augusta', 'Lemberopolous, Mr. Peter L',
'Culumovic, Mr. Jeso', 'Abbing, Mr. Anthony',
'Sage, Mr. Douglas Bullen', 'Markoff, Mr. Marin',
'Harper, Rev. John',
'Goldenberg, Mrs. Samuel L (Edwiga Grabowska)',
'Andersson, Master. Sigvard Harald Elias', 'Svensson, Mr. Johan',
'Boulos, Miss. Nourelain', 'Lines, Miss. Mary Conover',
'Carter, Mrs. Ernest Courtenay (Lilian Hughes)',
'Aks, Mrs. Sam (Leah Rosen)',
'Wick, Mrs. George Dennick (Mary Hitchcock)',
'Daly, Mr. Peter Denis ', 'Baclini, Mrs. Solomon (Latifa Qurban)',
'Razi, Mr. Raihed', 'Hansen, Mr. Claus Peter',
'Giles, Mr. Frederick Edward',
'Swift, Mrs. Frederick Joel (Margaret Welles Barron)',
'Sage, Miss. Dorothy Edith "Dolly"', 'Gill, Mr. John William',
'Bystrom, Mrs. (Karolina)', 'Duran y More, Miss. Asuncion',
'Roebling, Mr. Washington Augustus II',
'van Melkebeke, Mr. Philemon', 'Johnson, Master. Harold Theodor',
'Balkic, Mr. Cerin',
'Beckwith, Mrs. Richard Leonard (Sallie Monypeny)',
'Carlsson, Mr. Frans Olof', 'Vander Cruyssen, Mr. Victor',
'Abelson, Mrs. Samuel (Hannah Wizosky)',
'Najib, Miss. Adele Kiamie "Jane"',
'Gustafsson, Mr. Alfred Ossian', 'Petroff, Mr. Nedelio',
'Laleff, Mr. Kristo',
'Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)',
'Shelley, Mrs. William (Imanita Parrish Hall)',
'Markun, Mr. Johann', 'Dahlberg, Miss. Gerda Ulrika',
'Banfield, Mr. Frederick James', 'Sutehall, Mr. Henry Jr',
'Rice, Mrs. William (Margaret Norton)', 'Montvila, Rev. Juozas',
'Graham, Miss. Margaret Edith',
'Johnston, Miss. Catherine Helen "Carrie"',
'Behr, Mr. Karl Howell', 'Dooley, Mr. Patrick'], dtype=object)
df['Embarked'].unique()
array(['S', 'C', 'Q', nan], dtype=object)
df['Survived'].unique()
array([0, 1])
df['Pclass']
PassengerId
1 3
2 1
3 3
4 1
5 3
6 3
7 1
8 3
9 3
10 2
11 3
12 1
13 3
14 3
15 3
16 2
17 3
18 2
19 3
20 3
21 2
22 2
23 3
24 1
25 3
26 3
27 3
28 1
29 3
30 3
..
862 2
863 1
864 3
865 2
866 2
867 2
868 1
869 3
870 3
871 3
872 1
873 1
874 3
875 2
876 3
877 3
878 3
879 3
880 1
881 2
882 3
883 3
884 2
885 3
886 3
887 2
888 1
889 3
890 1
891 3
Name: Pclass, Length: 891, dtype: int64
df['Pclass'].unique()
array([3, 1, 2])
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
atheletes_df = pd.read_csv('athlete_events.csv')
regions_df = pd.read_csv('noc_regions.csv')
data_df = pd.merge(atheletes_df, regions_df, on='NOC', how='left')
data_df.head()
| ID | Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | Season | City | Sport | Event | Medal | region | notes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | A Dijiang | M | 24.0 | 180.0 | 80.0 | China | CHN | 1992 Summer | 1992 | Summer | Barcelona | Basketball | Basketball Men's Basketball | NaN | China | NaN |
| 1 | 2 | A Lamusi | M | 23.0 | 170.0 | 60.0 | China | CHN | 2012 Summer | 2012 | Summer | London | Judo | Judo Men's Extra-Lightweight | NaN | China | NaN |
| 2 | 3 | Gunnar Nielsen Aaby | M | 24.0 | NaN | NaN | Denmark | DEN | 1920 Summer | 1920 | Summer | Antwerpen | Football | Football Men's Football | NaN | Denmark | NaN |
| 3 | 4 | Edgar Lindenau Aabye | M | 34.0 | NaN | NaN | Denmark/Sweden | DEN | 1900 Summer | 1900 | Summer | Paris | Tug-Of-War | Tug-Of-War Men's Tug-Of-War | Gold | Denmark | NaN |
| 4 | 5 | Christine Jacoba Aaftink | F | 21.0 | 185.0 | 82.0 | Netherlands | NED | 1988 Winter | 1988 | Winter | Calgary | Speed Skating | Speed Skating Women's 500 metres | NaN | Netherlands | NaN |
data_df.columns
Index(['ID', 'Name', 'Sex', 'Age', 'Height', 'Weight', 'Team', 'NOC', 'Games',
'Year', 'Season', 'City', 'Sport', 'Event', 'Medal', 'region', 'notes'],
dtype='object')
data_df.describe()
| ID | Age | Height | Weight | Year | |
|---|---|---|---|---|---|
| count | 271116.000000 | 261642.000000 | 210945.000000 | 208241.000000 | 271116.000000 |
| mean | 68248.954396 | 25.556898 | 175.338970 | 70.702393 | 1978.378480 |
| std | 39022.286345 | 6.393561 | 10.518462 | 14.348020 | 29.877632 |
| min | 1.000000 | 10.000000 | 127.000000 | 25.000000 | 1896.000000 |
| 25% | 34643.000000 | 21.000000 | 168.000000 | 60.000000 | 1960.000000 |
| 50% | 68205.000000 | 24.000000 | 175.000000 | 70.000000 | 1988.000000 |
| 75% | 102097.250000 | 28.000000 | 183.000000 | 79.000000 | 2002.000000 |
| max | 135571.000000 | 97.000000 | 226.000000 | 214.000000 | 2016.000000 |
data_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 271116 entries, 0 to 271115 Data columns (total 17 columns): ID 271116 non-null int64 Name 271116 non-null object Sex 271116 non-null object Age 261642 non-null float64 Height 210945 non-null float64 Weight 208241 non-null float64 Team 271116 non-null object NOC 271116 non-null object Games 271116 non-null object Year 271116 non-null int64 Season 271116 non-null object City 271116 non-null object Sport 271116 non-null object Event 271116 non-null object Medal 39783 non-null object region 270746 non-null object notes 5039 non-null object dtypes: float64(3), int64(2), object(12) memory usage: 37.2+ MB
sns.distplot(data_df['Age'])
/home/kashif/anaconda3/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result. return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval /home/kashif/anaconda3/lib/python3.7/site-packages/numpy/core/_methods.py:28: RuntimeWarning: invalid value encountered in reduce return umr_maximum(a, axis, None, out, keepdims, initial) /home/kashif/anaconda3/lib/python3.7/site-packages/numpy/core/_methods.py:32: RuntimeWarning: invalid value encountered in reduce return umr_minimum(a, axis, None, out, keepdims, initial)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-8-8331d431a2ca> in <module> ----> 1 sns.distplot(data_df['Age']) ~/anaconda3/lib/python3.7/site-packages/seaborn/distributions.py in distplot(a, bins, hist, kde, rug, fit, hist_kws, kde_kws, rug_kws, fit_kws, color, vertical, norm_hist, axlabel, label, ax) 213 if hist: 214 if bins is None: --> 215 bins = min(_freedman_diaconis_bins(a), 50) 216 hist_kws.setdefault("alpha", 0.4) 217 if LooseVersion(mpl.__version__) < LooseVersion("2.2"): ~/anaconda3/lib/python3.7/site-packages/seaborn/distributions.py in _freedman_diaconis_bins(a) 37 return int(np.sqrt(a.size)) 38 else: ---> 39 return int(np.ceil((a.max() - a.min()) / h)) 40 41 ValueError: cannot convert float NaN to integer
age_df = data_df['Age']
age_df = pd.to_numeric(data_df['Age'], errors='coerce')
age_df = age_df.dropna()
age_df = age_df.astype(int)
age_df
0 24
1 23
2 24
3 34
4 21
5 21
6 25
7 25
8 27
9 27
10 31
11 31
12 31
13 31
14 33
15 33
16 33
17 33
18 31
19 31
20 31
21 31
22 33
23 33
24 33
25 33
26 18
27 18
28 26
29 26
..
271086 23
271087 19
271088 19
271089 34
271090 38
271091 32
271092 21
271093 21
271094 25
271095 25
271096 29
271097 29
271098 33
271099 36
271100 26
271101 24
271102 19
271103 23
271104 22
271105 23
271106 27
271107 21
271108 24
271109 28
271110 33
271111 29
271112 27
271113 27
271114 30
271115 34
Name: Age, Length: 261642, dtype: int64
sns.countplot(age_df)
<matplotlib.axes._subplots.AxesSubplot at 0x7f28ef021470>
data_df.loc[data_df['Medal'].isnull()]
medalists_df = data_df.loc[~data_df['Medal'].isnull()]
medalists_df
| ID | Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | Season | City | Sport | Event | Medal | region | notes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | 4 | Edgar Lindenau Aabye | M | 34.0 | NaN | NaN | Denmark/Sweden | DEN | 1900 Summer | 1900 | Summer | Paris | Tug-Of-War | Tug-Of-War Men's Tug-Of-War | Gold | Denmark | NaN |
| 37 | 15 | Arvo Ossian Aaltonen | M | 30.0 | NaN | NaN | Finland | FIN | 1920 Summer | 1920 | Summer | Antwerpen | Swimming | Swimming Men's 200 metres Breaststroke | Bronze | Finland | NaN |
| 38 | 15 | Arvo Ossian Aaltonen | M | 30.0 | NaN | NaN | Finland | FIN | 1920 Summer | 1920 | Summer | Antwerpen | Swimming | Swimming Men's 400 metres Breaststroke | Bronze | Finland | NaN |
| 40 | 16 | Juhamatti Tapio Aaltonen | M | 28.0 | 184.0 | 85.0 | Finland | FIN | 2014 Winter | 2014 | Winter | Sochi | Ice Hockey | Ice Hockey Men's Ice Hockey | Bronze | Finland | NaN |
| 41 | 17 | Paavo Johannes Aaltonen | M | 28.0 | 175.0 | 64.0 | Finland | FIN | 1948 Summer | 1948 | Summer | London | Gymnastics | Gymnastics Men's Individual All-Around | Bronze | Finland | NaN |
| 42 | 17 | Paavo Johannes Aaltonen | M | 28.0 | 175.0 | 64.0 | Finland | FIN | 1948 Summer | 1948 | Summer | London | Gymnastics | Gymnastics Men's Team All-Around | Gold | Finland | NaN |
| 44 | 17 | Paavo Johannes Aaltonen | M | 28.0 | 175.0 | 64.0 | Finland | FIN | 1948 Summer | 1948 | Summer | London | Gymnastics | Gymnastics Men's Horse Vault | Gold | Finland | NaN |
| 48 | 17 | Paavo Johannes Aaltonen | M | 28.0 | 175.0 | 64.0 | Finland | FIN | 1948 Summer | 1948 | Summer | London | Gymnastics | Gymnastics Men's Pommelled Horse | Gold | Finland | NaN |
| 50 | 17 | Paavo Johannes Aaltonen | M | 32.0 | 175.0 | 64.0 | Finland | FIN | 1952 Summer | 1952 | Summer | Helsinki | Gymnastics | Gymnastics Men's Team All-Around | Bronze | Finland | NaN |
| 60 | 20 | Kjetil Andr Aamodt | M | 20.0 | 176.0 | 85.0 | Norway | NOR | 1992 Winter | 1992 | Winter | Albertville | Alpine Skiing | Alpine Skiing Men's Super G | Gold | Norway | NaN |
| 61 | 20 | Kjetil Andr Aamodt | M | 20.0 | 176.0 | 85.0 | Norway | NOR | 1992 Winter | 1992 | Winter | Albertville | Alpine Skiing | Alpine Skiing Men's Giant Slalom | Bronze | Norway | NaN |
| 63 | 20 | Kjetil Andr Aamodt | M | 22.0 | 176.0 | 85.0 | Norway | NOR | 1994 Winter | 1994 | Winter | Lillehammer | Alpine Skiing | Alpine Skiing Men's Downhill | Silver | Norway | NaN |
| 64 | 20 | Kjetil Andr Aamodt | M | 22.0 | 176.0 | 85.0 | Norway | NOR | 1994 Winter | 1994 | Winter | Lillehammer | Alpine Skiing | Alpine Skiing Men's Super G | Bronze | Norway | NaN |
| 67 | 20 | Kjetil Andr Aamodt | M | 22.0 | 176.0 | 85.0 | Norway | NOR | 1994 Winter | 1994 | Winter | Lillehammer | Alpine Skiing | Alpine Skiing Men's Combined | Silver | Norway | NaN |
| 73 | 20 | Kjetil Andr Aamodt | M | 30.0 | 176.0 | 85.0 | Norway | NOR | 2002 Winter | 2002 | Winter | Salt Lake City | Alpine Skiing | Alpine Skiing Men's Super G | Gold | Norway | NaN |
| 76 | 20 | Kjetil Andr Aamodt | M | 30.0 | 176.0 | 85.0 | Norway | NOR | 2002 Winter | 2002 | Winter | Salt Lake City | Alpine Skiing | Alpine Skiing Men's Combined | Gold | Norway | NaN |
| 78 | 20 | Kjetil Andr Aamodt | M | 34.0 | 176.0 | 85.0 | Norway | NOR | 2006 Winter | 2006 | Winter | Torino | Alpine Skiing | Alpine Skiing Men's Super G | Gold | Norway | NaN |
| 79 | 21 | Ragnhild Margrethe Aamodt | F | 27.0 | 163.0 | NaN | Norway | NOR | 2008 Summer | 2008 | Summer | Beijing | Handball | Handball Women's Handball | Gold | Norway | NaN |
| 86 | 25 | Alf Lied Aanning | M | 24.0 | NaN | NaN | Norway | NOR | 1920 Summer | 1920 | Summer | Antwerpen | Gymnastics | Gymnastics Men's Team All-Around, Free System | Silver | Norway | NaN |
| 91 | 29 | Willemien Aardenburg | F | 22.0 | NaN | NaN | Netherlands | NED | 1988 Summer | 1988 | Summer | Seoul | Hockey | Hockey Women's Hockey | Bronze | Netherlands | NaN |
| 92 | 30 | Pepijn Aardewijn | M | 26.0 | 189.0 | 72.0 | Netherlands | NED | 1996 Summer | 1996 | Summer | Atlanta | Rowing | Rowing Men's Lightweight Double Sculls | Silver | Netherlands | NaN |
| 105 | 37 | Ann Kristin Aarnes | F | 23.0 | 182.0 | 64.0 | Norway | NOR | 1996 Summer | 1996 | Summer | Atlanta | Football | Football Women's Football | Bronze | Norway | NaN |
| 106 | 38 | Karl Jan Aas | M | 20.0 | NaN | NaN | Norway | NOR | 1920 Summer | 1920 | Summer | Antwerpen | Gymnastics | Gymnastics Men's Team All-Around, Free System | Silver | Norway | NaN |
| 110 | 40 | Roald Edgar Aas | M | 23.0 | NaN | NaN | Norway | NOR | 1952 Winter | 1952 | Winter | Oslo | Speed Skating | Speed Skating Men's 1,500 metres | Bronze | Norway | NaN |
| 113 | 40 | Roald Edgar Aas | M | 31.0 | NaN | NaN | Norway | NOR | 1960 Winter | 1960 | Winter | Squaw Valley | Speed Skating | Speed Skating Men's 1,500 metres | Gold | Norway | NaN |
| 117 | 42 | Thomas Valentin Aas | M | 25.0 | NaN | NaN | Taifun | NOR | 1912 Summer | 1912 | Summer | Stockholm | Sailing | Sailing Mixed 8 metres | Gold | Norway | NaN |
| 150 | 56 | Ren Abadie | M | 21.0 | NaN | NaN | France | FRA | 1956 Summer | 1956 | Summer | Melbourne | Cycling | Cycling Men's Road Race, Team | Gold | France | NaN |
| 158 | 62 | Giovanni Abagnale | M | 21.0 | 198.0 | 90.0 | Italy | ITA | 2016 Summer | 2016 | Summer | Rio de Janeiro | Rowing | Rowing Men's Coxless Pairs | Bronze | Italy | NaN |
| 159 | 63 | Jos Luis Abajo Gmez | M | 30.0 | 194.0 | 87.0 | Spain | ESP | 2008 Summer | 2008 | Summer | Beijing | Fencing | Fencing Men's epee, Individual | Bronze | Spain | NaN |
| 161 | 65 | Patimat Abakarova | F | 21.0 | 165.0 | 49.0 | Azerbaijan | AZE | 2016 Summer | 2016 | Summer | Rio de Janeiro | Taekwondo | Taekwondo Women's Flyweight | Bronze | Azerbaijan | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 270914 | 135481 | Jules Alexis "Louis" Zutter | M | 30.0 | NaN | NaN | Switzerland | SUI | 1896 Summer | 1896 | Summer | Athina | Gymnastics | Gymnastics Men's Horse Vault | Silver | Switzerland | NaN |
| 270915 | 135481 | Jules Alexis "Louis" Zutter | M | 30.0 | NaN | NaN | Switzerland | SUI | 1896 Summer | 1896 | Summer | Athina | Gymnastics | Gymnastics Men's Parallel Bars | Silver | Switzerland | NaN |
| 270917 | 135481 | Jules Alexis "Louis" Zutter | M | 30.0 | NaN | NaN | Switzerland | SUI | 1896 Summer | 1896 | Summer | Athina | Gymnastics | Gymnastics Men's Pommelled Horse | Gold | Switzerland | NaN |
| 270931 | 135486 | Viktor Valeryevich Zuyev | M | 21.0 | 188.0 | 91.0 | Belarus | BLR | 2004 Summer | 2004 | Summer | Athina | Boxing | Boxing Men's Heavyweight | Silver | Belarus | NaN |
| 270934 | 135488 | Nataliya Vladimirovna Zuyeva | F | 19.0 | 176.0 | 62.0 | Russia | RUS | 2008 Summer | 2008 | Summer | Beijing | Rhythmic Gymnastics | Rhythmic Gymnastics Women's Group | Gold | Russia | NaN |
| 270939 | 135489 | Anastasiya Valeryevna Zuyeva-Fesikova | F | 22.0 | 182.0 | 71.0 | Russia | RUS | 2012 Summer | 2012 | Summer | London | Swimming | Swimming Women's 200 metres Backstroke | Silver | Russia | NaN |
| 270946 | 135491 | Marijan uej | M | 22.0 | 186.0 | 93.0 | Yugoslavia | YUG | 1956 Summer | 1956 | Summer | Melbourne | Water Polo | Water Polo Men's Water Polo | Silver | Serbia | Yugoslavia |
| 270960 | 135498 | Denis vegelj | M | 20.0 | NaN | NaN | Slovenia | SLO | 1992 Summer | 1992 | Summer | Barcelona | Rowing | Rowing Men's Coxless Pairs | Bronze | Slovenia | NaN |
| 270969 | 135501 | Ellina Aleksandrovna Zvereva (Kisheyeva-) | F | 35.0 | 183.0 | 100.0 | Belarus | BLR | 1996 Summer | 1996 | Summer | Atlanta | Athletics | Athletics Women's Discus Throw | Bronze | Belarus | NaN |
| 270970 | 135501 | Ellina Aleksandrovna Zvereva (Kisheyeva-) | F | 39.0 | 183.0 | 100.0 | Belarus | BLR | 2000 Summer | 2000 | Summer | Sydney | Athletics | Athletics Women's Discus Throw | Gold | Belarus | NaN |
| 270976 | 135502 | Nataliya Maratovna "Natasha" Zvereva | F | 21.0 | 172.0 | 60.0 | Unified Team | EUN | 1992 Summer | 1992 | Summer | Barcelona | Tennis | Tennis Women's Doubles | Bronze | Russia | NaN |
| 270981 | 135503 | Zurab Zviadauri | M | 23.0 | 182.0 | 90.0 | Georgia | GEO | 2004 Summer | 2004 | Summer | Athina | Judo | Judo Men's Middleweight | Gold | Georgia | NaN |
| 270982 | 135504 | Viktor Oleksandrovych Zviahintsev | M | 25.0 | 178.0 | 79.0 | Soviet Union | URS | 1976 Summer | 1976 | Summer | Montreal | Football | Football Men's Football | Bronze | Russia | NaN |
| 270986 | 135508 | Vera Igorevna Zvonaryova | F | 23.0 | 172.0 | 59.0 | Russia | RUS | 2008 Summer | 2008 | Summer | Beijing | Tennis | Tennis Women's Singles | Bronze | Russia | NaN |
| 271009 | 135520 | Julia Zwehl | F | 28.0 | 167.0 | 60.0 | Germany | GER | 2004 Summer | 2004 | Summer | Athina | Hockey | Hockey Women's Hockey | Gold | Germany | NaN |
| 271010 | 135521 | Anton Zwerina | M | 23.0 | NaN | 66.0 | Austria | AUT | 1924 Summer | 1924 | Summer | Paris | Weightlifting | Weightlifting Men's Lightweight | Silver | Austria | NaN |
| 271013 | 135522 | Klaas Erik "Klaas-Erik" Zwering | M | 23.0 | 189.0 | 80.0 | Netherlands | NED | 2004 Summer | 2004 | Summer | Athina | Swimming | Swimming Men's 4 x 100 metres Freestyle Relay | Silver | Netherlands | NaN |
| 271015 | 135523 | Ronald Ferdinand "Ron" Zwerver | M | 25.0 | 200.0 | 93.0 | Netherlands | NED | 1992 Summer | 1992 | Summer | Barcelona | Volleyball | Volleyball Men's Volleyball | Silver | Netherlands | NaN |
| 271016 | 135523 | Ronald Ferdinand "Ron" Zwerver | M | 29.0 | 200.0 | 93.0 | Netherlands | NED | 1996 Summer | 1996 | Summer | Atlanta | Volleyball | Volleyball Men's Volleyball | Gold | Netherlands | NaN |
| 271019 | 135525 | Martin Zwicker | M | 29.0 | 175.0 | 64.0 | Germany | GER | 2016 Summer | 2016 | Summer | Rio de Janeiro | Hockey | Hockey Men's Hockey | Bronze | Germany | NaN |
| 271032 | 135535 | Claudia Antoinette Zwiers | F | 22.0 | 181.0 | 78.0 | Netherlands | NED | 1996 Summer | 1996 | Summer | Atlanta | Judo | Judo Women's Middleweight | Bronze | Netherlands | NaN |
| 271046 | 135544 | Krzysztof Zwoliski | M | 21.0 | 175.0 | 70.0 | Poland | POL | 1980 Summer | 1980 | Summer | Moskva | Athletics | Athletics Men's 4 x 100 metres Relay | Silver | Poland | NaN |
| 271048 | 135545 | Henk Jan Zwolle | M | 27.0 | 197.0 | 93.0 | Netherlands | NED | 1992 Summer | 1992 | Summer | Barcelona | Rowing | Rowing Men's Double Sculls | Bronze | Netherlands | NaN |
| 271049 | 135545 | Henk Jan Zwolle | M | 31.0 | 197.0 | 93.0 | Netherlands | NED | 1996 Summer | 1996 | Summer | Atlanta | Rowing | Rowing Men's Coxed Eights | Gold | Netherlands | NaN |
| 271076 | 135553 | Galina Ivanovna Zybina (-Fyodorova) | F | 21.0 | 168.0 | 80.0 | Soviet Union | URS | 1952 Summer | 1952 | Summer | Helsinki | Athletics | Athletics Women's Shot Put | Gold | Russia | NaN |
| 271078 | 135553 | Galina Ivanovna Zybina (-Fyodorova) | F | 25.0 | 168.0 | 80.0 | Soviet Union | URS | 1956 Summer | 1956 | Summer | Melbourne | Athletics | Athletics Women's Shot Put | Silver | Russia | NaN |
| 271080 | 135553 | Galina Ivanovna Zybina (-Fyodorova) | F | 33.0 | 168.0 | 80.0 | Soviet Union | URS | 1964 Summer | 1964 | Summer | Tokyo | Athletics | Athletics Women's Shot Put | Bronze | Russia | NaN |
| 271082 | 135554 | Bogusaw Zych | M | 28.0 | 182.0 | 82.0 | Poland | POL | 1980 Summer | 1980 | Summer | Moskva | Fencing | Fencing Men's Foil, Team | Bronze | Poland | NaN |
| 271102 | 135563 | Olesya Nikolayevna Zykina | F | 19.0 | 171.0 | 64.0 | Russia | RUS | 2000 Summer | 2000 | Summer | Sydney | Athletics | Athletics Women's 4 x 400 metres Relay | Bronze | Russia | NaN |
| 271103 | 135563 | Olesya Nikolayevna Zykina | F | 23.0 | 171.0 | 64.0 | Russia | RUS | 2004 Summer | 2004 | Summer | Athina | Athletics | Athletics Women's 4 x 400 metres Relay | Silver | Russia | NaN |
39783 rows × 17 columns
def plot_column(my_df, col, chart_type='Histogram', dtype=int,
bin_size=25):
temp_df = pd.to_numeric(my_df[col], errors='coerce')
temp_df = temp_df.dropna()
temp_df = temp_df.astype(dtype)
if chart_type=='Histogram':
ax = sns.countplot(temp_df)
elif chart_type=='Density':
ax = sns.distplot(temp_df)
xmin, xmax = ax.get_xlim()
ax.set_xticks(np.round(np.linspace(xmin, xmax, bin_size), 2))
plt.tight_layout()
plt.locator_params(axis='y', nbins=6)
plt.show()
plot_column(medalists_df, 'Age')
from scipy.stats import skew
age_df = pd.to_numeric(medalists_df['Age'], errors='coerce')
age_df = age_df.dropna()
age_df = age_df.astype(int)
print("Skewness is {}".format(skew(age_df)))
print("Mean is {}".format(np.mean(age_df)))
print("Median is {}".format(np.median(age_df)))
Skewness is 1.497531959387686 Mean is 25.925174771452717 Median is 25.0
plot_column(medalists_df, 'Height', bin_size=15)
Height_df = pd.to_numeric(medalists_df['Height'], errors='coerce')
Height_df = Height_df.dropna()
Height_df = Height_df.astype(int)
print("Skewness is {}".format(skew(Height_df)))
print("Mean is {}".format(np.mean(Height_df)))
print("Median is {}".format(np.median(Height_df)))
Skewness is 0.046825246979302765 Mean is 177.55419670442842 Median is 178.0
plot_column(medalists_df, 'Weight', bin_size=15)
Weight_df = pd.to_numeric(medalists_df['Weight'], errors='coerce')
Weight_df = Weight_df.dropna()
Weight_df = Weight_df.astype(int)
print("Skewness is {}".format(skew(Weight_df)))
print("Mean is {}".format(np.mean(Weight_df)))
print("Median is {}".format(np.median(Weight_df)))
Skewness is 0.6921002780813605 Mean is 73.76723798266352 Median is 73.0
sports_df = medalists_df[~medalists_df['Sport'].isnull()]
sns.countplot(medalists_df['Sport'])
<matplotlib.axes._subplots.AxesSubplot at 0x7f28e62802b0>
sum(medalists_df['Sport'].isnull())
sports_count=medalists_df['Sport'].value_counts().nlargest(25).to_frame()
#sports_count.reset_index(inplace=True)
print(sports_count)
Sport Athletics 3969 Swimming 3048 Rowing 2945 Gymnastics 2256 Fencing 1743 Football 1571 Ice Hockey 1530 Hockey 1528 Wrestling 1296 Cycling 1263 Sailing 1232 Shooting 1228 Canoeing 1165 Basketball 1080 Handball 1060 Water Polo 1057 Volleyball 969 Equestrianism 965 Boxing 944 Cross Country Skiing 776 Weightlifting 646 Speed Skating 580 Judo 547 Alpine Skiing 428 Diving 427
ax = sports_count.plot.bar(y='Sport')
ax.get_legend().remove()
year_count_df=data_df['Year'].value_counts().to_frame()
year_count_df.sort_index(inplace=True)
ax = year_count_df.plot.bar(y='Year')
ax.get_legend().remove()
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
titanic_data_df = pd.read_csv('titanic-data.csv')
g = sns.countplot(x='Sex', hue='Survived', data=titanic_data_df)
g = sns.catplot(x="Embarked", col="Survived",
data=titanic_data_df, kind="count",
height=4, aspect=.7);
g = sns.countplot(x='Embarked', hue='Survived', data=titanic_data_df)
g = sns.countplot(x='Embarked', hue='Pclass', data=titanic_data_df)
g = sns.countplot(x='Pclass', hue='Survived', data=titanic_data_df)
I will be adding a new column 'Family Size' which will be the SibSp and Parch + 1
#Function to add new column 'FamilySize'
def add_family(df):
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
return df
titanic_data_df = add_family(titanic_data_df)
titanic_data_df.head(10)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | FamilySize | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 2 |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 2 |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 1 |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 2 |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 1 |
| 5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q | 1 |
| 6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S | 1 |
| 7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S | 5 |
| 8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S | 3 |
| 9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C | 2 |
g = sns.countplot(x="FamilySize", hue="Survived",
data=titanic_data_df);
g = sns.countplot(x="FamilySize", hue="Sex",
data=titanic_data_df);
age_df = titanic_data_df[~titanic_data_df['Age'].isnull()]
#Make bins and group all passengers into these bins and store those values in a new column 'ageGroup'
age_bins = ['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79']
age_df['ageGroup'] = pd.cut(titanic_data_df.Age, range(0, 81, 10), right=False, labels=age_bins)
/home/kashif/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy after removing the cwd from sys.path.
#age_df[['Age', 'ageGroup']]
sns.countplot(x='ageGroup', hue='Survived', data=age_df)
<matplotlib.axes._subplots.AxesSubplot at 0x7fd1996c6940>
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
pokemon_df = pd.read_csv('Pokemon.csv', index_col=0)
pokemon_df.head()
| Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| # | ||||||||||||
| 1 | Bulbasaur | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
| 2 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
| 3 | Venusaur | Grass | Poison | 525 | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False |
| 3 | VenusaurMega Venusaur | Grass | Poison | 625 | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False |
| 4 | Charmander | Fire | NaN | 309 | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False |
pokemon_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 800 entries, 1 to 721 Data columns (total 12 columns): Name 800 non-null object Type 1 800 non-null object Type 2 414 non-null object Total 800 non-null int64 HP 800 non-null int64 Attack 800 non-null int64 Defense 800 non-null int64 Sp. Atk 800 non-null int64 Sp. Def 800 non-null int64 Speed 800 non-null int64 Generation 800 non-null int64 Legendary 800 non-null bool dtypes: bool(1), int64(8), object(3) memory usage: 75.8+ KB
pokemon_df.describe()
| Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | |
|---|---|---|---|---|---|---|---|---|
| count | 800.00000 | 800.000000 | 800.000000 | 800.000000 | 800.000000 | 800.000000 | 800.000000 | 800.00000 |
| mean | 435.10250 | 69.258750 | 79.001250 | 73.842500 | 72.820000 | 71.902500 | 68.277500 | 3.32375 |
| std | 119.96304 | 25.534669 | 32.457366 | 31.183501 | 32.722294 | 27.828916 | 29.060474 | 1.66129 |
| min | 180.00000 | 1.000000 | 5.000000 | 5.000000 | 10.000000 | 20.000000 | 5.000000 | 1.00000 |
| 25% | 330.00000 | 50.000000 | 55.000000 | 50.000000 | 49.750000 | 50.000000 | 45.000000 | 2.00000 |
| 50% | 450.00000 | 65.000000 | 75.000000 | 70.000000 | 65.000000 | 70.000000 | 65.000000 | 3.00000 |
| 75% | 515.00000 | 80.000000 | 100.000000 | 90.000000 | 95.000000 | 90.000000 | 90.000000 | 5.00000 |
| max | 780.00000 | 255.000000 | 190.000000 | 230.000000 | 194.000000 | 230.000000 | 180.000000 | 6.00000 |
pokemon_df['Type 2'].fillna(value='NA', inplace=True)
pokemon_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 800 entries, 1 to 721 Data columns (total 12 columns): Name 800 non-null object Type 1 800 non-null object Type 2 800 non-null object Total 800 non-null int64 HP 800 non-null int64 Attack 800 non-null int64 Defense 800 non-null int64 Sp. Atk 800 non-null int64 Sp. Def 800 non-null int64 Speed 800 non-null int64 Generation 800 non-null int64 Legendary 800 non-null bool dtypes: bool(1), int64(8), object(3) memory usage: 75.8+ KB
legendry_df = pokemon_df[pokemon_df['Legendary']==True]
ax = sns.countplot(pokemon_df['Type 1'])
g= ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
ax = sns.countplot(pokemon_df[pokemon_df['Type 2']!='NA']['Type 2'])
g= ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
pokemon_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 800 entries, 1 to 721 Data columns (total 12 columns): Name 800 non-null object Type 1 800 non-null object Type 2 800 non-null object Total 800 non-null int64 HP 800 non-null int64 Attack 800 non-null int64 Defense 800 non-null int64 Sp. Atk 800 non-null int64 Sp. Def 800 non-null int64 Speed 800 non-null int64 Generation 800 non-null int64 Legendary 800 non-null bool dtypes: bool(1), int64(8), object(3) memory usage: 95.8+ KB
plt.subplots(figsize = (20,5))
plt.title('Attack by Type1')
sns.boxplot(x = "Type 1", y = "Attack",data = pokemon_df)
<matplotlib.axes._subplots.AxesSubplot at 0x7f0e80ea1240>
plt.subplots(figsize = (20,5))
plt.title('Attack by Type2')
sns.boxplot(x = "Type 2", y = "Attack",data = pokemon_df)
<matplotlib.axes._subplots.AxesSubplot at 0x7f0e8098b208>
plt.subplots(figsize = (20,5))
plt.title('Defense by Type1')
sns.boxplot(x = "Type 1", y = "Defense",data = pokemon_df)
<matplotlib.axes._subplots.AxesSubplot at 0x7f0e88968198>
plt.subplots(figsize = (20,5))
plt.title('Defense by Type2')
sns.boxplot(x = "Type 2", y = "Defense",data = pokemon_df)
<matplotlib.axes._subplots.AxesSubplot at 0x7f0e819622e8>
type_grouped = pokemon_df[pokemon_df['Type 2']!='NA'].groupby(['Type 1', 'Type 2']).size()
print(type_grouped)
Type 1 Type 2
Bug Electric 2
Fighting 2
Fire 2
Flying 14
Ghost 1
Grass 6
Ground 2
Poison 12
Rock 3
Steel 7
Water 1
Dark Dragon 3
Fighting 2
Fire 3
Flying 5
Ghost 2
Ice 2
Psychic 2
Steel 2
Dragon Electric 1
Fairy 1
Fire 1
Flying 6
Ground 5
Ice 3
Psychic 4
Electric Dragon 1
Fairy 1
Fire 1
Flying 5
..
Rock Fighting 1
Flying 4
Grass 2
Ground 6
Ice 2
Psychic 2
Steel 3
Water 6
Steel Dragon 1
Fairy 3
Fighting 1
Flying 1
Ghost 4
Ground 2
Psychic 7
Rock 3
Water Dark 6
Dragon 2
Electric 2
Fairy 2
Fighting 3
Flying 7
Ghost 2
Grass 3
Ground 10
Ice 3
Poison 3
Psychic 5
Rock 4
Steel 1
Length: 136, dtype: int64
sns.set(rc={'figure.figsize':(11,8)})
sns.heatmap(
type_grouped.unstack(),
annot=True,
)
plt.xticks(rotation=90)
plt.show()
type_grouped = legendry_df[legendry_df['Type 2']!='NA'].groupby(['Type 1', 'Type 2']).size()
sns.set(rc={'figure.figsize':(11,8)})
sns.heatmap(
type_grouped.unstack(),
annot=True,
)
plt.xticks(rotation=90)
plt.show()
pokemon_gen = legendry_df.groupby('Generation')['Name'].count()
sns.lineplot(data=pokemon_gen)
<matplotlib.axes._subplots.AxesSubplot at 0x7f0e7bed1ac8>
legendry_df[legendry_df['Generation']==3]
| Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| # | ||||||||||||
| 377 | Regirock | Rock | NA | 580 | 80 | 100 | 200 | 50 | 100 | 50 | 3 | True |
| 378 | Regice | Ice | NA | 580 | 80 | 50 | 100 | 100 | 200 | 50 | 3 | True |
| 379 | Registeel | Steel | NA | 580 | 80 | 75 | 150 | 75 | 150 | 50 | 3 | True |
| 380 | Latias | Dragon | Psychic | 600 | 80 | 80 | 90 | 110 | 130 | 110 | 3 | True |
| 380 | LatiasMega Latias | Dragon | Psychic | 700 | 80 | 100 | 120 | 140 | 150 | 110 | 3 | True |
| 381 | Latios | Dragon | Psychic | 600 | 80 | 90 | 80 | 130 | 110 | 110 | 3 | True |
| 381 | LatiosMega Latios | Dragon | Psychic | 700 | 80 | 130 | 100 | 160 | 120 | 110 | 3 | True |
| 382 | Kyogre | Water | NA | 670 | 100 | 100 | 90 | 150 | 140 | 90 | 3 | True |
| 382 | KyogrePrimal Kyogre | Water | NA | 770 | 100 | 150 | 90 | 180 | 160 | 90 | 3 | True |
| 383 | Groudon | Ground | NA | 670 | 100 | 150 | 140 | 100 | 90 | 90 | 3 | True |
| 383 | GroudonPrimal Groudon | Ground | Fire | 770 | 100 | 180 | 160 | 150 | 90 | 90 | 3 | True |
| 384 | Rayquaza | Dragon | Flying | 680 | 105 | 150 | 90 | 150 | 90 | 95 | 3 | True |
| 384 | RayquazaMega Rayquaza | Dragon | Flying | 780 | 105 | 180 | 100 | 180 | 100 | 115 | 3 | True |
| 385 | Jirachi | Steel | Psychic | 600 | 100 | 100 | 100 | 100 | 100 | 100 | 3 | True |
| 386 | DeoxysNormal Forme | Psychic | NA | 600 | 50 | 150 | 50 | 150 | 50 | 150 | 3 | True |
| 386 | DeoxysAttack Forme | Psychic | NA | 600 | 50 | 180 | 20 | 180 | 20 | 150 | 3 | True |
| 386 | DeoxysDefense Forme | Psychic | NA | 600 | 50 | 70 | 160 | 70 | 160 | 90 | 3 | True |
| 386 | DeoxysSpeed Forme | Psychic | NA | 600 | 50 | 95 | 90 | 95 | 90 | 180 | 3 | True |
legendry_df[legendry_df['Generation']==5]
| Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| # | ||||||||||||
| 494 | Victini | Psychic | Fire | 600 | 100 | 100 | 100 | 100 | 100 | 100 | 5 | True |
| 638 | Cobalion | Steel | Fighting | 580 | 91 | 90 | 129 | 90 | 72 | 108 | 5 | True |
| 639 | Terrakion | Rock | Fighting | 580 | 91 | 129 | 90 | 72 | 90 | 108 | 5 | True |
| 640 | Virizion | Grass | Fighting | 580 | 91 | 90 | 72 | 90 | 129 | 108 | 5 | True |
| 641 | TornadusIncarnate Forme | Flying | NA | 580 | 79 | 115 | 70 | 125 | 80 | 111 | 5 | True |
| 641 | TornadusTherian Forme | Flying | NA | 580 | 79 | 100 | 80 | 110 | 90 | 121 | 5 | True |
| 642 | ThundurusIncarnate Forme | Electric | Flying | 580 | 79 | 115 | 70 | 125 | 80 | 111 | 5 | True |
| 642 | ThundurusTherian Forme | Electric | Flying | 580 | 79 | 105 | 70 | 145 | 80 | 101 | 5 | True |
| 643 | Reshiram | Dragon | Fire | 680 | 100 | 120 | 100 | 150 | 120 | 90 | 5 | True |
| 644 | Zekrom | Dragon | Electric | 680 | 100 | 150 | 120 | 120 | 100 | 90 | 5 | True |
| 645 | LandorusIncarnate Forme | Ground | Flying | 600 | 89 | 125 | 90 | 115 | 80 | 101 | 5 | True |
| 645 | LandorusTherian Forme | Ground | Flying | 600 | 89 | 145 | 90 | 105 | 80 | 91 | 5 | True |
| 646 | Kyurem | Dragon | Ice | 660 | 125 | 130 | 90 | 130 | 90 | 95 | 5 | True |
| 646 | KyuremBlack Kyurem | Dragon | Ice | 700 | 125 | 170 | 100 | 120 | 90 | 95 | 5 | True |
| 646 | KyuremWhite Kyurem | Dragon | Ice | 700 | 125 | 120 | 90 | 170 | 100 | 95 | 5 | True |
max_type1_per_gen = pokemon_df.groupby(['Generation','Type 1']).size()
max_type1_per_gen.unstack().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7f0e8426f5c0>
# type1_per_gen = pd.DataFrame({'count' : pokemon_df.groupby( [ "Generation", "Type 1"] ).size()}).reset_index()
# print(pokemon_df.groupby( [ "Generation", "Type 1"] ).size())
from bokeh.palettes import Spectral11
from bokeh.plotting import figure, output_file, show
from bokeh.models import Legend, LegendItem
p = figure(plot_width=800, plot_height=550, x_range=(1, 7))
p.background_fill_color = "beige"
p.title.text = 'Click on legend entries to hide the corresponding lines'
import random
legend_list = []
for type_id in type1_per_gen['Type 1'].unique():
color = random.choice(Spectral11)
df = pd.DataFrame(type1_per_gen[type1_per_gen['Type 1']==type_id])
p.line(df['Generation'], df['count'], line_width=2, alpha=0.8, color=color, legend=type_id)
p.legend.location = "top_right"
p.legend.click_policy="hide"
show(p)
pokemon_df.groupby([ "Generation", "Type 1"])[['Total']].max()
| Total | ||
|---|---|---|
| Generation | Type 1 | |
| 1 | Bug | 600 |
| Dragon | 600 | |
| Electric | 580 | |
| Fairy | 483 | |
| Fighting | 505 | |
| Fire | 634 | |
| Ghost | 600 | |
| Grass | 625 | |
| Ground | 485 | |
| Ice | 580 | |
| Normal | 590 | |
| Poison | 505 | |
| Psychic | 780 | |
| Rock | 615 | |
| Water | 640 | |
| 2 | Bug | 600 |
| Dark | 600 | |
| Electric | 610 | |
| Fairy | 450 | |
| Fighting | 455 | |
| Fire | 680 | |
| Ghost | 435 | |
| Grass | 525 | |
| Ground | 500 | |
| Ice | 450 | |
| Normal | 540 | |
| Poison | 535 | |
| Psychic | 680 | |
| Rock | 700 | |
| Steel | 610 | |
| ... | ... | ... |
| 5 | Fighting | 510 |
| Fire | 540 | |
| Flying | 580 | |
| Ghost | 520 | |
| Grass | 580 | |
| Ground | 600 | |
| Ice | 535 | |
| Normal | 600 | |
| Poison | 474 | |
| Psychic | 600 | |
| Rock | 580 | |
| Steel | 580 | |
| Water | 580 | |
| 6 | Bug | 411 |
| Dark | 680 | |
| Dragon | 600 | |
| Electric | 481 | |
| Fairy | 680 | |
| Fighting | 500 | |
| Fire | 600 | |
| Flying | 535 | |
| Ghost | 494 | |
| Grass | 531 | |
| Ice | 514 | |
| Normal | 472 | |
| Poison | 494 | |
| Psychic | 680 | |
| Rock | 700 | |
| Steel | 520 | |
| Water | 530 |
98 rows × 1 columns
# type1_total_gen = pd.DataFrame({'Total' : pokemon_df.groupby( [ "Generation", "Type 1"] )['Total'].max()}).reset_index()
# print(pokemon_df.groupby( [ "Generation", "Type 1"] )['Total'].max())
from bokeh.palettes import Spectral11
from bokeh.plotting import figure, output_file, show
from bokeh.models import Legend, LegendItem
p = figure(plot_width=800, plot_height=550, x_range=(1, 7))
p.background_fill_color = "beige"
p.title.text = 'Click on legend entries to hide the corresponding lines'
import random
legend_list = []
for type_id in type1_total_gen['Type 1'].unique():
color = random.choice(Spectral11)
df = pd.DataFrame(type1_total_gen[type1_total_gen['Type 1']==type_id])
p.line(df['Generation'], df['Total'], line_width=2, alpha=0.8, color=color, legend=type_id)
p.legend.location = "top_right"
p.legend.click_policy="hide"
show(p)
Download Link : https://archive.ics.uci.edu/ml/datasets/wine+quality
Citation : P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
red_wine_df = pd.read_csv('winequality-red.csv', delimiter=';')
red_wine_df.head()
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
| 1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
| 2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
| 3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 |
| 4 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
red_wine_df.columns
Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
'pH', 'sulphates', 'alcohol', 'quality'],
dtype='object')
red_wine_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1599 entries, 0 to 1598 Data columns (total 12 columns): fixed acidity 1599 non-null float64 volatile acidity 1599 non-null float64 citric acid 1599 non-null float64 residual sugar 1599 non-null float64 chlorides 1599 non-null float64 free sulfur dioxide 1599 non-null float64 total sulfur dioxide 1599 non-null float64 density 1599 non-null float64 pH 1599 non-null float64 sulphates 1599 non-null float64 alcohol 1599 non-null float64 quality 1599 non-null int64 dtypes: float64(11), int64(1) memory usage: 150.0 KB
red_wine_df.describe()
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 |
| mean | 8.319637 | 0.527821 | 0.270976 | 2.538806 | 0.087467 | 15.874922 | 46.467792 | 0.996747 | 3.311113 | 0.658149 | 10.422983 | 5.636023 |
| std | 1.741096 | 0.179060 | 0.194801 | 1.409928 | 0.047065 | 10.460157 | 32.895324 | 0.001887 | 0.154386 | 0.169507 | 1.065668 | 0.807569 |
| min | 4.600000 | 0.120000 | 0.000000 | 0.900000 | 0.012000 | 1.000000 | 6.000000 | 0.990070 | 2.740000 | 0.330000 | 8.400000 | 3.000000 |
| 25% | 7.100000 | 0.390000 | 0.090000 | 1.900000 | 0.070000 | 7.000000 | 22.000000 | 0.995600 | 3.210000 | 0.550000 | 9.500000 | 5.000000 |
| 50% | 7.900000 | 0.520000 | 0.260000 | 2.200000 | 0.079000 | 14.000000 | 38.000000 | 0.996750 | 3.310000 | 0.620000 | 10.200000 | 6.000000 |
| 75% | 9.200000 | 0.640000 | 0.420000 | 2.600000 | 0.090000 | 21.000000 | 62.000000 | 0.997835 | 3.400000 | 0.730000 | 11.100000 | 6.000000 |
| max | 15.900000 | 1.580000 | 1.000000 | 15.500000 | 0.611000 | 72.000000 | 289.000000 | 1.003690 | 4.010000 | 2.000000 | 14.900000 | 8.000000 |
red_wine_df['quality']
0 5
1 5
2 5
3 6
4 5
5 5
6 5
7 7
8 7
9 5
10 5
11 5
12 5
13 5
14 5
15 5
16 7
17 5
18 4
19 6
20 6
21 5
22 5
23 5
24 6
25 5
26 5
27 5
28 5
29 6
..
1569 6
1570 6
1571 6
1572 5
1573 6
1574 6
1575 6
1576 6
1577 6
1578 6
1579 5
1580 6
1581 5
1582 5
1583 5
1584 7
1585 6
1586 6
1587 6
1588 6
1589 5
1590 6
1591 6
1592 6
1593 6
1594 5
1595 6
1596 6
1597 5
1598 6
Name: quality, Length: 1599, dtype: int64
red_wine_df.describe()
sns.set(rc={'figure.figsize':(7,6)})
sns.countplot(red_wine_df['quality'])
<matplotlib.axes._subplots.AxesSubplot at 0x7fdf97f15eb8>
sns.pairplot(red_wine_df)
<seaborn.axisgrid.PairGrid at 0x7fdf97f29be0>
sns.set(rc={'figure.figsize':(12,10)})
sns.heatmap(red_wine_df.corr(), annot=True, fmt='.2f', linewidths=2)
<matplotlib.axes._subplots.AxesSubplot at 0x7fdf94860390>
sns.distplot(red_wine_df['alcohol'])
/home/kashif/anaconda3/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result. return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
<matplotlib.axes._subplots.AxesSubplot at 0x7fdf88fbc588>
from scipy.stats import skew
skew(red_wine_df['alcohol'])
0.8600210646566755
def draw_hist(temp_df, bin_size = 15):
ax = sns.distplot(temp_df)
#xmin, xmax = ax.get_xlim()
#ax.set_xticks(np.round(np.linspace(xmin, xmax, bin_size), 2))
plt.tight_layout()
plt.locator_params(axis='y', nbins=6)
plt.show()
print("Skewness is {}".format(skew(temp_df)))
print("Mean is {}".format(np.median(temp_df)))
print("Median is {}".format(np.mean(temp_df)))
draw_hist(red_wine_df['alcohol'])
/home/kashif/anaconda3/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result. return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Skewness is 0.8600210646566755 Mean is 10.2 Median is 10.422983114446502
sns.boxplot(x='quality', y='alcohol', data=red_wine_df)
<matplotlib.axes._subplots.AxesSubplot at 0x7fdf88a28518>
sns.boxplot(x='quality', y='alcohol', data=red_wine_df,
showfliers=False)
<matplotlib.axes._subplots.AxesSubplot at 0x7fdf87abf5f8>
joint_plt = sns.jointplot(x='alcohol', y='pH', data=red_wine_df,
kind='reg')
/home/kashif/anaconda3/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result. return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
from scipy.stats import pearsonr
def get_corr(col1, col2, temp_df):
pearson_corr, p_value = pearsonr(temp_df[col1], temp_df[col2])
print("Correlation between {} and {} is {}".format(col1, col2, pearson_corr))
print("P-value of this correlation is {}".format(p_value))
get_corr('alcohol', 'pH', red_wine_df)
Correlation between alcohol and pH is 0.20563250850549822 P-value of this correlation is 9.964497741460977e-17
joint_plt = sns.jointplot(x='alcohol', y='density', data=red_wine_df,
kind='reg')
get_corr('alcohol', 'density', red_wine_df)
Correlation between alcohol and density is -0.4961797702417011 P-value of this correlation is 3.938835339991827e-100
g = sns.FacetGrid(red_wine_df, col="quality")
g = g.map(sns.regplot, "density", "alcohol")
sns.boxplot(x='quality', y='sulphates', data=red_wine_df)
<matplotlib.axes._subplots.AxesSubplot at 0x7fdf861a4f60>
sns.boxplot(x='quality', y='total sulfur dioxide', data=red_wine_df)
<matplotlib.axes._subplots.AxesSubplot at 0x7fdf860d5668>
sns.boxplot(x='quality', y='free sulfur dioxide', data=red_wine_df)
<matplotlib.axes._subplots.AxesSubplot at 0x7fdf860545f8>
red_wine_df.columns
sns.boxplot(x='quality', y='fixed acidity', data=red_wine_df)
<matplotlib.axes._subplots.AxesSubplot at 0x7fdf8618f668>
sns.boxplot(x='quality', y='citric acid', data=red_wine_df)
<matplotlib.axes._subplots.AxesSubplot at 0x7fdf85e49ac8>
sns.boxplot(x='quality', y='volatile acidity', data=red_wine_df)
<matplotlib.axes._subplots.AxesSubplot at 0x7fdf85d8eeb8>
red_wine_df.columns
Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
'pH', 'sulphates', 'alcohol', 'quality'],
dtype='object')
get_corr('pH', 'citric acid', red_wine_df)
red_wine_df['total acidity'] = (red_wine_df['fixed acidity']+ red_wine_df['citric acid'] + red_wine_df['volatile acidity'])
sns.boxplot(x='quality', y='total acidity', data=red_wine_df,
showfliers=False)
<matplotlib.axes._subplots.AxesSubplot at 0x7fdf8572b9b0>
sns.regplot(x='pH', y='total acidity', data=red_wine_df)
<matplotlib.axes._subplots.AxesSubplot at 0x7fdf856a1ac8>
g = sns.FacetGrid(red_wine_df, col="quality")
g = g.map(sns.regplot, "total acidity", "pH")
get_corr('total acidity', 'pH', red_wine_df)
g = sns.FacetGrid(red_wine_df, col="quality")
g = g.map(sns.regplot, "free sulfur dioxide", "pH")